## Evaluating a Regression Model

In this exercise, you will create a pipeline for a linear regression model, and then test and evaluate the model.

### Prepare the Data

First, import the libraries you will need and prepare the training and test data:

In [1]:
import os
os.environ["HADOOP_USER_NAME"] = "spark"
os.environ["SPARK_MAJOR_VERSION"] = "2"
os.environ["SPARK_HOME"] = "/usr/hdp/current/spark2-client"
import findspark
findspark.init()
import pyspark

In [2]:
# Import Spark SQL and Spark ML libraries
from pyspark.sql.types import *
from pyspark.sql.functions import *
from pyspark.sql import SparkSession
from pyspark.ml import Pipeline
from pyspark.ml.regression import LinearRegression
from pyspark.ml.feature import VectorAssembler

In [3]:
spark = SparkSession.builder.appName('python-regression-evaluation').getOrCreate()
spark.conf.set('spark.executor.memory', '3g')
spark.conf.set('spark.executor.cores', '3')
spark.conf.set('spark.cores.max', '3')
spark.conf.set('spark.driver.memory','3g')

In [4]:
# Load the source data
csv = spark.read.csv('/user/maria_dev/data/flights.csv', inferSchema=True, header=True)

# Select features and label
data = csv.select("DayofMonth", "DayOfWeek", "OriginAirportID", "DestAirportID", "DepDelay", col("ArrDelay").alias("label"))

# Split the data
splits = data.randomSplit([0.7, 0.3])
train = splits[0]
test = splits[1].withColumnRenamed("label", "trueLabel")

### Define the Pipeline and Train the Model
Now define a pipeline that creates a feature vector and trains a regression model

In [5]:
# Define the pipeline
assembler = VectorAssembler(inputCols = ["DayofMonth", "DayOfWeek", "OriginAirportID", "DestAirportID", "DepDelay"], outputCol="features")
lr = LinearRegression(labelCol="label",featuresCol="features", maxIter=10, regParam=0.3)
pipeline = Pipeline(stages=[assembler, lr])

# Train the model
piplineModel = pipeline.fit(train)

### Test the Model
Now you're ready to apply the model to the test data.

In [6]:
prediction = piplineModel.transform(test)
predicted = prediction.select("features", "prediction", "trueLabel")
predicted.show()

+--------------------+-------------------+---------+
|            features|         prediction|trueLabel|
+--------------------+-------------------+---------+
|[1.0,1.0,10140.0,...| -5.549643029733337|      -18|
|[1.0,1.0,10140.0,...| -5.549643029733337|      -17|
|[1.0,1.0,10140.0,...|-3.5561614822716647|      -12|
|[1.0,1.0,10140.0,...| 31.130070881777108|       41|
|[1.0,1.0,10140.0,...|-3.7635011241449643|       -5|
|[1.0,1.0,10140.0,...|-0.7732788029524564|        2|
|[1.0,1.0,10140.0,...| 18.164795897933427|       19|
|[1.0,1.0,10140.0,...| -13.73229884787929|      -13|
|[1.0,1.0,10140.0,...| -9.745335752955947|      -25|
|[1.0,1.0,10140.0,...|  -8.74859497922511|       -6|
|[1.0,1.0,10140.0,...| -5.758372658032602|       -2|
|[1.0,1.0,10140.0,...|-3.7648911105709306|      -11|
|[1.0,1.0,10140.0,...|  82.95155620401181|       68|
|[1.0,1.0,10140.0,...| -4.968508197366411|      -10|
|[1.0,1.0,10140.0,...| 11.976084956057802|       13|
|[1.0,1.0,10140.0,...|-11.963068443806842|    

### Examine the Predicted and Actual Values
You can plot the predicted values against the actual values to see how accurately the model has predicted. In a perfect model, the resulting scatter plot should form a perfect diagonal line with each predicted value being identical to the actual value - in practice, some variance is to be expected.
Run the cells below to create a temporary table from the **predicted** DataFrame and then retrieve the predicted and actual label values using SQL. You can then display the results as a scatter plot, specifying **-** as the function to show the unaggregated values.

In [7]:
predicted.createOrReplaceTempView("regressionPredictions")

In [9]:
spark.sql("SELECT trueLabel, prediction FROM regressionPredictions").show()

+---------+-------------------+
|trueLabel|         prediction|
+---------+-------------------+
|      -18| -5.549643029733337|
|      -17| -5.549643029733337|
|      -12|-3.5561614822716647|
|       41| 31.130070881777108|
|       -5|-3.7635011241449643|
|        2|-0.7732788029524564|
|       19| 18.164795897933427|
|      -13| -13.73229884787929|
|      -25| -9.745335752955947|
|       -6|  -8.74859497922511|
|       -2| -5.758372658032602|
|      -11|-3.7648911105709306|
|       68|  82.95155620401181|
|      -10| -4.968508197366411|
|       13| 11.976084956057802|
|      -17|-11.963068443806842|
|      -16| -8.972846122614333|
|      -21| 0.9945616146940268|
|      812|  831.2796261324804|
|        9| -8.025218202600971|
+---------+-------------------+
only showing top 20 rows



### Retrieve the Root Mean Square Error (RMSE)
There are a number of metrics used to measure the variance between predicted and actual values. Of these, the root mean square error (RMSE) is a commonly used value that is measured in the same units as the predicted and actual values - so in this case, the RMSE indicates the average number of minutes between predicted and actual flight delay values. You can use the **RegressionEvaluator** class to retrieve the RMSE.


In [10]:
from pyspark.ml.evaluation import RegressionEvaluator

evaluator = RegressionEvaluator(labelCol="trueLabel", predictionCol="prediction", metricName="rmse")
rmse = evaluator.evaluate(prediction)
print "Root Mean Square Error (RMSE):", rmse

Root Mean Square Error (RMSE): 13.1789860356
