## Evaluating a Regression Model

In this exercise, you will create a pipeline for a linear regression model, and then test and evaluate the model.

### Prepare the Data

First, import the libraries you will need and prepare the training and test data:

In [1]:
from pyspark.sql import SparkSession

spark = SparkSession.\
        builder.\
        appName("pyspark-notebook").\
        master("spark://spark-master:7077").\
        config("spark.executor.memory", "4098m").\
        getOrCreate()

In [2]:
# Import Spark SQL and Spark ML libraries
from pyspark.sql.types import *
from pyspark.sql.functions import *

from pyspark.ml import Pipeline
from pyspark.ml.regression import LinearRegression
from pyspark.ml.feature import VectorAssembler, StringIndexer, MinMaxScaler
from pyspark.ml.tuning import ParamGridBuilder, CrossValidator
from pyspark.ml.evaluation import RegressionEvaluator

# Load the source data
flightSchema = StructType([
  StructField("DayofMonth", IntegerType(), False),
  StructField("DayOfWeek", IntegerType(), False),
  StructField("Carrier", StringType(), False),
  StructField("OriginAirportID", StringType(), False),
  StructField("DestAirportID", StringType(), False),
  StructField("DepDelay", IntegerType(), False),
  StructField("ArrDelay", IntegerType(), False),
  StructField("Late", IntegerType(), False),
])

In [3]:
data = spark.read.csv('../data/flights.csv', schema=flightSchema, header=True)
data = data.select("DayofMonth", "DayOfWeek", "Carrier", "OriginAirportID", "DestAirportID", "DepDelay", col("ArrDelay").alias("label"))

# Split the data
splits = data.randomSplit([0.7, 0.3])
train = splits[0]
test = splits[1]

### Define the Pipeline and Train the Model
Now define a pipeline that creates a feature vector and trains a regression model

In [4]:
# Define the pipeline
monthdayIndexer = StringIndexer(inputCol="DayofMonth", outputCol="DayofMonthIdx")
weekdayIndexer = StringIndexer(inputCol="DayOfWeek", outputCol="DayOfWeekIdx")
carrierIndexer = StringIndexer(inputCol="Carrier", outputCol="CarrierIdx")
originIndexer = StringIndexer(inputCol="OriginAirportID", outputCol="OriginAirportIdx")
destIndexer = StringIndexer(inputCol="DestAirportID", outputCol="DestAirportIdx")
numVect = VectorAssembler(inputCols = ["DepDelay"], outputCol="numFeatures")
minMax = MinMaxScaler(inputCol = numVect.getOutputCol(), outputCol="normNums")
featVect = VectorAssembler(inputCols=["DayofMonthIdx", "DayOfWeekIdx", "CarrierIdx", "OriginAirportIdx", "DestAirportIdx", "normNums"], outputCol="features")
lr = LinearRegression(labelCol="label", featuresCol="features")
pipeline = Pipeline(stages=[monthdayIndexer, weekdayIndexer, carrierIndexer, originIndexer, destIndexer, numVect, minMax, featVect, lr])

# Train the model
piplineModel = pipeline.fit(train)

### Test the Model
Now you're ready to apply the model to the test data.

In [5]:
prediction = piplineModel.transform(test)
predicted = prediction.select("features", "prediction", "label")
predicted.show()

+--------------------+-------------------+-----+
|            features|         prediction|label|
+--------------------+-------------------+-----+
|[25.0,2.0,10.0,1....| 3.7093081991818835|    2|
|[25.0,2.0,10.0,11...| -2.644396212061494|   -2|
|[25.0,2.0,10.0,11...| 10.889733546459318|   28|
|[25.0,2.0,10.0,36...| -1.981488581717045|   -2|
|[25.0,2.0,10.0,7....| 101.49844052719371|   76|
|[25.0,2.0,10.0,48...| 14.982686451963787|    7|
|[25.0,2.0,10.0,38...| 30.259437968633364|   27|
|[25.0,2.0,10.0,38...| 140.37931163061737|  144|
|[25.0,2.0,10.0,38...| 21.907761979051344|    2|
|[25.0,2.0,10.0,38...| 5.8082190461635435|   21|
|[25.0,2.0,10.0,38...|  4.030097211164708|   -4|
|[25.0,2.0,10.0,38...|-1.3204872590478018|  -10|
|[25.0,2.0,10.0,38...| 59.893110660041984|   54|
|[25.0,2.0,10.0,38...|  74.98176623412854|   66|
|[25.0,2.0,10.0,23...| 28.837063279861724|   50|
|[25.0,2.0,10.0,3....| -1.392058532238778|   -7|
|[25.0,2.0,10.0,14...| 23.463355905779622|   29|
|[25.0,2.0,10.0,14..

### Examine the Predicted and Actual Values
You can plot the predicted values against the actual values to see how accurately the model has predicted. In a perfect model, the resulting scatter plot should form a perfect diagonal line with each predicted value being identical to the actual value - in practice, some variance is to be expected.
Run the cells below to create a temporary table from the **predicted** DataFrame and then retrieve the predicted and actual label values using SQL. You can then display the results as a scatter plot to see how well the predicted delay correlates to the actual delay.

In [6]:
predicted.createOrReplaceTempView("regressionPredictions")

In [7]:
spark.sql("SELECT label, prediction FROM regressionPredictions").show()

+-----+-------------------+
|label|         prediction|
+-----+-------------------+
|    2| 3.7093081991818835|
|   -2| -2.644396212061494|
|   28| 10.889733546459318|
|   -2| -1.981488581717045|
|   76| 101.49844052719371|
|    7| 14.982686451963787|
|   27| 30.259437968633364|
|  144| 140.37931163061737|
|    2| 21.907761979051344|
|   21| 5.8082190461635435|
|   -4|  4.030097211164708|
|  -10|-1.3204872590478018|
|   54| 59.893110660041984|
|   66|  74.98176623412854|
|   50| 28.837063279861724|
|   -7| -1.392058532238778|
|   29| 23.463355905779622|
|   26|  4.274818204930895|
|   29|  13.06601043746997|
|  -14|-0.8359512806325142|
+-----+-------------------+
only showing top 20 rows



### Retrieve the Root Mean Square Error (RMSE)
There are a number of metrics used to measure the variance between predicted and actual values. Of these, the root mean square error (RMSE) is a commonly used value that is measured in the same units as the predicted and actual values - so in this case, the RMSE indicates the average number of minutes between predicted and actual flight delay values. You can use the **RegressionEvaluator** class to retrieve the RMSE.

In [8]:
from pyspark.ml.evaluation import RegressionEvaluator

evaluator = RegressionEvaluator(labelCol="label", predictionCol="prediction", metricName="rmse")
rmse = evaluator.evaluate(prediction)
print ("Root Mean Square Error (RMSE):", rmse)

Root Mean Square Error (RMSE): 17.3843564168783
