## Evaluating a Classification Model

In this exercise, you will create a pipeline for a classification model, and then apply commonly used metrics to evaluate the resulting classifier.

### Prepare the Data

First, import the libraries you will need and prepare the training and test data:

In [1]:
from pyspark.sql import SparkSession

spark = SparkSession.\
        builder.\
        appName("pyspark-notebook").\
        master("spark://spark-master:7077").\
        config("spark.executor.memory", "4098m").\
        getOrCreate()

In [3]:
from pyspark.sql.types import *
from pyspark.sql.functions import *

from pyspark.ml import Pipeline
from pyspark.ml.classification import LogisticRegression
from pyspark.ml.feature import VectorAssembler, StringIndexer, MinMaxScaler
from pyspark.ml.evaluation import BinaryClassificationEvaluator

In [4]:
# Load the source data
flightSchema = StructType([
  StructField("DayofMonth", IntegerType(), False),
  StructField("DayOfWeek", IntegerType(), False),
  StructField("Carrier", StringType(), False),
  StructField("OriginAirportID", StringType(), False),
  StructField("DestAirportID", StringType(), False),
  StructField("DepDelay", IntegerType(), False),
  StructField("ArrDelay", IntegerType(), False),
  StructField("Late", IntegerType(), False),
])

data = spark.read.csv('../data/flights.csv', schema=flightSchema, header=True)

# Split the data
splits = data.randomSplit([0.7, 0.3])
train = splits[0]
test = splits[1]

### Define the Pipeline and Train the Model
Now define a pipeline that creates a feature vector and trains a classification model

In [5]:
monthdayIndexer = StringIndexer(inputCol="DayofMonth", outputCol="DayofMonthIdx")
weekdayIndexer = StringIndexer(inputCol="DayOfWeek", outputCol="DayOfWeekIdx")
carrierIndexer = StringIndexer(inputCol="Carrier", outputCol="CarrierIdx")
originIndexer = StringIndexer(inputCol="OriginAirportID", outputCol="OriginAirportIdx")
destIndexer = StringIndexer(inputCol="DestAirportID", outputCol="DestAirportIdx")
numVect = VectorAssembler(inputCols = ["DepDelay"], outputCol="numFeatures")
minMax = MinMaxScaler(inputCol = numVect.getOutputCol(), outputCol="normNums")
featVect = VectorAssembler(inputCols=["DayofMonthIdx", "DayOfWeekIdx", "CarrierIdx", "OriginAirportIdx", "DestAirportIdx", "normNums"], outputCol="features")
lr = LogisticRegression(labelCol="Late", featuresCol="features")
pipeline = Pipeline(stages=[monthdayIndexer, weekdayIndexer, carrierIndexer, originIndexer, destIndexer, numVect, minMax, featVect, lr])
model = pipeline.fit(train)

### Test the Model
Now you're ready to apply the model to the test data.

In [6]:
prediction = model.transform(test)
predicted = prediction.select("features", col("prediction").cast("Int"), col("Late").alias("trueLabel"))
predicted.show(100, truncate=False)

+---------------------------------------------+----------+---------+
|features                                     |prediction|trueLabel|
+---------------------------------------------+----------+---------+
|[25.0,2.0,10.0,1.0,51.0,0.26865671641791045] |0         |0        |
|[25.0,2.0,10.0,57.0,34.0,0.3582089552238806] |1         |1        |
|[25.0,2.0,10.0,11.0,41.0,0.23383084577114427]|0         |0        |
|[25.0,2.0,10.0,11.0,12.0,0.24378109452736318]|0         |1        |
|[25.0,2.0,10.0,18.0,16.0,0.5920398009950248] |1         |1        |
|[25.0,2.0,10.0,49.0,10.0,0.3233830845771144] |0         |0        |
|[25.0,2.0,10.0,49.0,16.0,0.2537313432835821] |0         |0        |
|[25.0,2.0,10.0,37.0,19.0,0.5671641791044776] |1         |1        |
|[25.0,2.0,10.0,37.0,22.0,0.24378109452736318]|0         |1        |
|[25.0,2.0,10.0,37.0,43.0,0.24378109452736318]|0         |0        |
|[25.0,2.0,10.0,37.0,43.0,0.5323383084577115] |1         |1        |
|[25.0,2.0,10.0,37.0,1.0,0.5572139

### Compute Confusion Matrix Metrics
Classifiers are typically evaluated by creating a *confusion matrix*, which indicates the number of:
- True Positives
- True Negatives
- False Positives
- False Negatives

From these core measures, other evaluation metrics such as *precision* and *recall* can be calculated.

In [7]:
tp = float(predicted.filter("prediction == 1.0 AND truelabel == 1").count())
fp = float(predicted.filter("prediction == 1.0 AND truelabel == 0").count())
tn = float(predicted.filter("prediction == 0.0 AND truelabel == 0").count())
fn = float(predicted.filter("prediction == 0.0 AND truelabel == 1").count())
metrics = spark.createDataFrame([
 ("TP", tp),
 ("FP", fp),
 ("TN", tn),
 ("FN", fn),
 ("Precision", tp / (tp + fp)),
 ("Recall", tp / (tp + fn))],["metric", "value"])
metrics.show()

+---------+------------------+
|   metric|             value|
+---------+------------------+
|       TP|           88908.0|
|       FP|            7203.0|
|       TN|           81310.0|
|       FN|           13299.0|
|Precision|0.9250554046883291|
|   Recall|0.8698817106460418|
+---------+------------------+



### View the Raw Prediction and Probability
The prediction is based on a raw prediction score that describes a labelled point in a logistic function. This raw prediction is then converted to a predicted label of 0 or 1 based on a probability vector that indicates the confidence for each possible label value (in this case, 0 and 1). The value with the highest confidence is selected as the prediction.

In [8]:
prediction.select("rawPrediction", "probability", "prediction", col("Late").alias("trueLabel")).show(100, truncate=False)

+------------------------------------------+------------------------------------------+----------+---------+
|rawPrediction                             |probability                               |prediction|trueLabel|
+------------------------------------------+------------------------------------------+----------+---------+
|[1.7208039617102058,-1.7208039617102058]  |[0.8482323627065036,0.15176763729349646]  |0.0       |0        |
|[-0.4315931839841358,0.4315931839841358]  |[0.39374595818416447,0.6062540418158355]  |1.0       |1        |
|[2.5675991874600257,-2.5675991874600257]  |[0.9287469831321856,0.07125301686781427]  |0.0       |0        |
|[2.0144335926591417,-2.0144335926591417]  |[0.8823042028914301,0.11769579710856985]  |0.0       |1        |
|[-6.83253445628921,6.83253445628921]      |[0.0010769611027759108,0.998923038897224] |1.0       |1        |
|[0.16906153509922817,-0.16906153509922817]|[0.5421650022686558,0.4578349977313442]   |0.0       |0        |
|[2.016302615474006

Note that the results include rows where the probability for 0 (the first value in the **probability** vector) is only slightly higher than the probability for 1 (the second value in the **probability** vector). The default *discrimination threshold* (the boundary that decides whether a probability is predicted as a 1 or a 0) is set to 0.5; so the prediction with the highest probability is always used, no matter how close to the threshold.

### Review the Area Under ROC
Another way to assess the performance of a classification model is to measure the area under a *received operator characteristic (ROC) curve* for the model. The **spark.ml** library includes a **BinaryClassificationEvaluator** class that you can use to compute this. A ROC curve plots the True Positive and False Positive rates for varying threshold values (the probability value over which a class label is predicted). The area under this curve gives an overall indication of the models accuracy as a value between 0 and 1. A value under 0.5 means that a binary classification model (which predicts one of two possible labels) is no better at predicting the right class than a random 50/50 guess.

In [9]:
evaluator = BinaryClassificationEvaluator(labelCol="Late", rawPredictionCol="rawPrediction", metricName="areaUnderROC")
auc = evaluator.evaluate(prediction)
print ("AUC = ", auc)

AUC =  0.9480829658583104
