## Evaluating a Classification Model

In this exercise, you will create a pipeline for a classification model, and then apply commonly used metrics to evaluate the resulting classifier.

### Prepare the Data

First, import the libraries you will need and prepare the training and test data:

In [1]:
import os
os.environ["HADOOP_USER_NAME"] = "spark"
os.environ["SPARK_MAJOR_VERSION"] = "2"
os.environ["SPARK_HOME"] = "/usr/hdp/current/spark2-client"
import findspark
findspark.init()
import pyspark

In [2]:
from pyspark.sql.types import *
from pyspark.sql.functions import *
from pyspark.sql import SparkSession
from pyspark.ml.classification import LogisticRegression
from pyspark.ml.feature import VectorAssembler
from pyspark.ml import Pipeline
from pyspark.ml.evaluation import BinaryClassificationEvaluator

In [3]:
spark = SparkSession.builder.appName('python-classification-evaluation').getOrCreate()
spark.conf.set('spark.executor.memory', '3g')
spark.conf.set('spark.executor.cores', '3')
spark.conf.set('spark.cores.max', '3')
spark.conf.set('spark.driver.memory','3g')

In [4]:
# Load the source data
csv = spark.read.csv('/user/maria_dev/data/flights.csv', inferSchema=True, header=True)

# Select features and label
data = csv.select("DayofMonth", "DayOfWeek", "OriginAirportID", "DestAirportID", "DepDelay", ((col("ArrDelay") > 15).cast("Int").alias("label")))

# Split the data
splits = data.randomSplit([0.7, 0.3])
train = splits[0]
test = splits[1].withColumnRenamed("label", "trueLabel")

### Define the Pipeline and Train the Model
Now define a pipeline that creates a feature vector and trains a classification model

In [5]:
assembler = VectorAssembler(inputCols = ["DayofMonth", "DayOfWeek", "OriginAirportID", "DestAirportID", "DepDelay"], outputCol="features")
lr = LogisticRegression(labelCol="label",featuresCol="features",maxIter=10,regParam=0.3)
pipeline = Pipeline(stages=[assembler, lr])
model = pipeline.fit(train)

### Test the Model
Now you're ready to apply the model to the test data.

In [6]:
prediction = model.transform(test)
predicted = prediction.select("features", "prediction", "trueLabel")
predicted.show(100, truncate=False)

+-------------------------------+----------+---------+
|features                       |prediction|trueLabel|
+-------------------------------+----------+---------+
|[1.0,1.0,10140.0,10397.0,-2.0] |0.0       |0        |
|[1.0,1.0,10140.0,11259.0,-5.0] |0.0       |0        |
|[1.0,1.0,10140.0,11259.0,12.0] |0.0       |0        |
|[1.0,1.0,10140.0,11259.0,24.0] |0.0       |0        |
|[1.0,1.0,10140.0,11259.0,35.0] |0.0       |1        |
|[1.0,1.0,10140.0,11292.0,0.0]  |0.0       |0        |
|[1.0,1.0,10140.0,11292.0,3.0]  |0.0       |0        |
|[1.0,1.0,10140.0,11292.0,4.0]  |0.0       |0        |
|[1.0,1.0,10140.0,11298.0,-10.0]|0.0       |0        |
|[1.0,1.0,10140.0,11298.0,-9.0] |0.0       |0        |
|[1.0,1.0,10140.0,11298.0,-5.0] |0.0       |0        |
|[1.0,1.0,10140.0,11298.0,-4.0] |0.0       |0        |
|[1.0,1.0,10140.0,11298.0,-1.0] |0.0       |0        |
|[1.0,1.0,10140.0,11298.0,34.0] |0.0       |1        |
|[1.0,1.0,10140.0,11298.0,87.0] |0.0       |1        |
|[1.0,1.0,

### Compute Confusion Matrix Metrics
Classifiers are typically evaluated by creating a *confusion matrix*, which indicates the number of:
- True Positives
- True Negatives
- False Positives
- False Negatives

From these core measures, other evaluation metrics such as *precision* and *recall* can be calculated.

In [7]:
tp = float(predicted.filter("prediction == 1.0 AND truelabel == 1").count())
fp = float(predicted.filter("prediction == 1.0 AND truelabel == 0").count())
tn = float(predicted.filter("prediction == 0.0 AND truelabel == 0").count())
fn = float(predicted.filter("prediction == 0.0 AND truelabel == 1").count())
metrics = spark.createDataFrame([
 ("TP", tp),
 ("FP", fp),
 ("TN", tn),
 ("FN", fn),
 ("Precision", tp / (tp + fp)),
 ("Recall", tp / (tp + fn))],["metric", "value"])
metrics.show()

+---------+-------------------+
|   metric|              value|
+---------+-------------------+
|       TP|            19200.0|
|       FP|               92.0|
|       TN|           648116.0|
|       FN|           142154.0|
|Precision| 0.9952311839104292|
|   Recall|0.11899302155509005|
+---------+-------------------+



### View the Raw Prediction and Probability
The prediction is based on a raw prediction score that describes a labelled point in a logistic function. This raw prediction is then converted to a predicted label of 0 or 1 based on a probability vector that indicates the confidence for each possible label value (in this case, 0 and 1). The value with the highest confidence is selected as the prediction.

In [8]:
prediction.select("rawPrediction", "probability", "prediction", "trueLabel").show(100, truncate=False)

+------------------------------------------+----------------------------------------+----------+---------+
|rawPrediction                             |probability                             |prediction|trueLabel|
+------------------------------------------+----------------------------------------+----------+---------+
|[1.5683310065218605,-1.5683310065218605]  |[0.8275455498714597,0.17245445012854024]|0.0       |0        |
|[1.6130930368184753,-1.6130930368184753]  |[0.8338403711712089,0.16615962882879107]|0.0       |0        |
|[1.3744520862533793,-1.3744520862533793]  |[0.7980985028277349,0.20190149717226502]|0.0       |0        |
|[1.2059996505603703,-1.2059996505603703]  |[0.7695903680679447,0.23040963193205524]|0.0       |0        |
|[1.0515849178417787,-1.0515849178417787]  |[0.7410791307494629,0.2589208692505371] |0.0       |1        |
|[1.5430059307692665,-1.5430059307692665]  |[0.8239012743039459,0.17609872569605403]|0.0       |0        |
|[1.5008928218460142,-1.5008928218460

Note that the results include rows where the probability for 0 (the first value in the **probability** vector) is only slightly higher than the probability for 1 (the second value in the **probability** vector). The default *discrimination threshold* (the boundary that decides whether a probability is predicted as a 1 or a 0) is set to 0.5; so the prediction with the highest probability is always used, no matter how close to the threshold.

### Review the Area Under ROC
Another way to assess the performance of a classification model is to measure the area under a ROC curve for the model. the spark.ml library includes a **BinaryClassificationEvaluator** class that you can use to compute this. The ROC curve shows the True Positive and False Positive rates plotted for varying thresholds.

In [9]:
evaluator = BinaryClassificationEvaluator(labelCol="trueLabel", rawPredictionCol="rawPrediction", metricName="areaUnderROC")
aur = evaluator.evaluate(prediction)
print "AUR = ", aur

AUR =  0.92234391107


### Change the Discrimination Threshold
The AUC score seems to indicate a reasonably good model, but the performance metrics seem to indicate that it predicts a high number of False Negative labels (i.e. it predicts 0 when the true label is 1), leading to a low Recall. You can affect the way a model performs by changing its parameters. For example, as noted previously, the default discrimination threshold is set to 0.5 - so if there are a lot of False Positives, you may want to consider raising this; or conversely, you may want to address a large number of False Negatives by lowering the threshold.

In [10]:
lr2 = LogisticRegression(labelCol="label",featuresCol="features",maxIter=10,regParam=0.3, threshold=0.35)
pipeline2 = Pipeline(stages=[assembler, lr2])
model2 = pipeline2.fit(train)
newPrediction = model2.transform(test)
newPrediction.select("rawPrediction", "probability", "prediction", "trueLabel").show(100, truncate=False)

+------------------------------------------+----------------------------------------+----------+---------+
|rawPrediction                             |probability                             |prediction|trueLabel|
+------------------------------------------+----------------------------------------+----------+---------+
|[1.5683310065218605,-1.5683310065218605]  |[0.8275455498714597,0.17245445012854024]|0.0       |0        |
|[1.6130930368184753,-1.6130930368184753]  |[0.8338403711712089,0.16615962882879107]|0.0       |0        |
|[1.3744520862533793,-1.3744520862533793]  |[0.7980985028277349,0.20190149717226502]|0.0       |0        |
|[1.2059996505603703,-1.2059996505603703]  |[0.7695903680679447,0.23040963193205524]|0.0       |0        |
|[1.0515849178417787,-1.0515849178417787]  |[0.7410791307494629,0.2589208692505371] |0.0       |1        |
|[1.5430059307692665,-1.5430059307692665]  |[0.8239012743039459,0.17609872569605403]|0.0       |0        |
|[1.5008928218460142,-1.5008928218460

Note that some of the **rawPrediction** and **probability** values that were previously predicted as 0 are now predicted as 1

In [11]:
# Recalculate confusion matrix
tp2 = float(newPrediction.filter("prediction == 1.0 AND truelabel == 1").count())
fp2 = float(newPrediction.filter("prediction == 1.0 AND truelabel == 0").count())
tn2 = float(newPrediction.filter("prediction == 0.0 AND truelabel == 0").count())
fn2 = float(newPrediction.filter("prediction == 0.0 AND truelabel == 1").count())
metrics2 = spark.createDataFrame([
 ("TP", tp2),
 ("FP", fp2),
 ("TN", tn2),
 ("FN", fn2),
 ("Precision", tp2 / (tp2 + fp2)),
 ("Recall", tp2 / (tp2 + fn2))],["metric", "value"])
metrics2.show()

+---------+------------------+
|   metric|             value|
+---------+------------------+
|       TP|           41677.0|
|       FP|             139.0|
|       TN|          648069.0|
|       FN|          119677.0|
|Precision|0.9966759135259231|
|   Recall|0.2582954249662233|
+---------+------------------+



Note that there are now more True Positives and less False Negatives, and Recall has improved. By changing the discrimination threshold, the model now gets more predictions correct - though it's worth noting that the number of False Positives has also increased.

In [12]:
evaluator = BinaryClassificationEvaluator(labelCol="trueLabel", rawPredictionCol="rawPrediction", metricName="areaUnderROC")
aur = evaluator.evaluate(newPrediction)
print "AUR = ", aur

AUR =  0.92234391107
