## Tuning Model Parameters

In this exercise, you will optimise the parameters for a classification model.

### Prepare the Data

First, import the libraries you will need and prepare the training and test data:

In [1]:
from pyspark.sql import SparkSession

spark = SparkSession.\
        builder.\
        appName("pyspark-notebook").\
        master("spark://spark-master:7077").\
        config("spark.executor.memory", "4098m").\
        getOrCreate()

In [2]:
# Import Spark SQL and Spark ML libraries
from pyspark.sql.types import *
from pyspark.sql.functions import *

from pyspark.ml import Pipeline
from pyspark.ml.classification import LogisticRegression
from pyspark.ml.feature import VectorAssembler, StringIndexer, MinMaxScaler
from pyspark.ml.tuning import ParamGridBuilder, TrainValidationSplit
from pyspark.ml.evaluation import BinaryClassificationEvaluator

# Load the source data
flightSchema = StructType([
  StructField("DayofMonth", IntegerType(), False),
  StructField("DayOfWeek", IntegerType(), False),
  StructField("Carrier", StringType(), False),
  StructField("OriginAirportID", StringType(), False),
  StructField("DestAirportID", StringType(), False),
  StructField("DepDelay", IntegerType(), False),
  StructField("ArrDelay", IntegerType(), False),
  StructField("Late", IntegerType(), False),
])

In [3]:
data = spark.read.csv('../data/flights.csv', schema=flightSchema, header=True)
data = data.select("DayofMonth", "DayOfWeek", "Carrier", "OriginAirportID", "DestAirportID", "DepDelay", col("Late").alias("label"))

# Split the data
splits = data.randomSplit([0.7, 0.3])
train = splits[0]
test = splits[1]

### Define the Pipeline
Now define a pipeline that creates a feature vector and trains a classification model

In [4]:
# Define the pipeline
monthdayIndexer = StringIndexer(inputCol="DayofMonth", outputCol="DayofMonthIdx")
weekdayIndexer = StringIndexer(inputCol="DayOfWeek", outputCol="DayOfWeekIdx")
carrierIndexer = StringIndexer(inputCol="Carrier", outputCol="CarrierIdx")
originIndexer = StringIndexer(inputCol="OriginAirportID", outputCol="OriginAirportIdx")
destIndexer = StringIndexer(inputCol="DestAirportID", outputCol="DestAirportIdx")
numVect = VectorAssembler(inputCols = ["DepDelay"], outputCol="numFeatures")
minMax = MinMaxScaler(inputCol = numVect.getOutputCol(), outputCol="normNums")
featVect = VectorAssembler(inputCols=["DayofMonthIdx", "DayOfWeekIdx", "CarrierIdx", "OriginAirportIdx", "DestAirportIdx", "normNums"], outputCol="features")
lr = LogisticRegression(labelCol="label", featuresCol="features")
pipeline = Pipeline(stages=[monthdayIndexer, weekdayIndexer, carrierIndexer, originIndexer, destIndexer, numVect, minMax, featVect, lr])

### Tune Parameters
You can tune parameters to find the best model for your data. A simple way to do this is to use  **TrainValidationSplit** to evaluate each combination of parameters defined in a **ParameterGrid** against a subset of the training data in order to find the best performing parameters.

In [5]:
paramGrid = ParamGridBuilder().addGrid(lr.regParam, [0.1, 0.01, 0.001]).addGrid(lr.maxIter, [10, 5, 2]).build()
tvs = TrainValidationSplit(estimator=pipeline, evaluator=BinaryClassificationEvaluator(), estimatorParamMaps=paramGrid, trainRatio=0.8)

model = tvs.fit(train)

### Test the Model
Now you're ready to apply the model to the test data.

In [6]:
prediction = model.transform(test)
predicted = prediction.select("features", "prediction", "probability", "label")
predicted.show(100)

+--------------------+----------+--------------------+-----+
|            features|prediction|         probability|label|
+--------------------+----------+--------------------+-----+
|[25.0,2.0,10.0,57...|       1.0|[0.41948570395457...|    1|
|[25.0,2.0,10.0,11...|       0.0|[0.92543672481980...|    0|
|[25.0,2.0,10.0,11...|       1.0|[0.18739681563565...|    1|
|[25.0,2.0,10.0,18...|       1.0|[0.00151930167638...|    1|
|[25.0,2.0,10.0,18...|       0.0|[0.89235702240844...|    0|
|[25.0,2.0,10.0,8....|       1.0|[1.51662602578178...|    1|
|[25.0,2.0,10.0,48...|       0.0|[0.87728414561262...|    0|
|[25.0,2.0,10.0,37...|       0.0|[0.56474263118840...|    1|
|[25.0,2.0,10.0,37...|       1.0|[9.61106838213785...|    1|
|[25.0,2.0,10.0,37...|       0.0|[0.90740459057788...|    0|
|[25.0,2.0,10.0,37...|       1.0|[0.00983703232840...|    1|
|[25.0,2.0,10.0,37...|       0.0|[0.84896813768499...|    0|
|[25.0,2.0,10.0,37...|       1.0|[4.79947137749929...|    1|
|[25.0,2.0,10.0,23...|  

### Compute Confusion Matrix Metrics
Classifiers are typically evaluated by creating a *confusion matrix*, which indicates the number of:
- True Positives
- True Negatives
- False Positives
- False Negatives

From these core measures, other evaluation metrics such as *precision* and *recall* can be calculated.

In [7]:
tp = float(predicted.filter("prediction == 1.0 AND label == 1").count())
fp = float(predicted.filter("prediction == 1.0 AND label == 0").count())
tn = float(predicted.filter("prediction == 0.0 AND label == 0").count())
fn = float(predicted.filter("prediction == 0.0 AND label == 1").count())
metrics = spark.createDataFrame([
 ("TP", tp),
 ("FP", fp),
 ("TN", tn),
 ("FN", fn),
 ("Precision", tp / (tp + fp)),
 ("Recall", tp / (tp + fn))],["metric", "value"])
metrics.show()

+---------+------------------+
|   metric|             value|
+---------+------------------+
|       TP|           89510.0|
|       FP|            7470.0|
|       TN|           81839.0|
|       FN|           13142.0|
|Precision|0.9229738090327902|
|   Recall|0.8719752172388263|
+---------+------------------+



### Review the Area Under ROC
Another way to assess the performance of a classification model is to measure the area under a ROC curve for the model. the spark.ml library includes a **BinaryClassificationEvaluator** class that you can use to compute this.

In [8]:
evaluator = BinaryClassificationEvaluator(labelCol="label", rawPredictionCol="prediction", metricName="areaUnderROC")
auc = evaluator.evaluate(prediction)
print ("AUC = ", auc)

AUC =  0.894166515560483
