## Using Cross Validation

In this exercise, you will use cross-validation to optimize parameters for a regression model.

### Prepare the Data

First, import the libraries you will need and prepare the training and test data:

In [1]:
import os
os.environ["HADOOP_USER_NAME"] = "spark"
os.environ["SPARK_MAJOR_VERSION"] = "2"
os.environ["SPARK_HOME"] = "/usr/hdp/current/spark2-client"
import findspark
findspark.init()
import pyspark

In [2]:
# Import Spark SQL and Spark ML libraries
from pyspark.sql.types import *
from pyspark.sql.functions import *
from pyspark.sql import SparkSession
from pyspark.ml import Pipeline
from pyspark.ml.regression import LinearRegression
from pyspark.ml.feature import VectorAssembler
from pyspark.ml.tuning import ParamGridBuilder, CrossValidator
from pyspark.ml.evaluation import RegressionEvaluator

In [3]:
spark = SparkSession.builder.appName('python-cross-validation').getOrCreate()
spark.conf.set('spark.executor.memory', '3g')
spark.conf.set('spark.executor.cores', '3')
spark.conf.set('spark.cores.max', '3')
spark.conf.set('spark.driver.memory','3g')

In [4]:
# Load the source data
csv = spark.read.csv('/user/maria_dev/data/flights.csv', inferSchema=True, header=True)

# Select features and label
data = csv.select("DayofMonth", "DayOfWeek", "OriginAirportID", "DestAirportID", "DepDelay", col("ArrDelay").alias("label"))

# Split the data
splits = data.randomSplit([0.7, 0.3])
train = splits[0]
test = splits[1].withColumnRenamed("label", "trueLabel")

### Define the Pipeline
Now define a pipeline that creates a feature vector and trains a regression model

In [5]:
# Define the pipeline
assembler = VectorAssembler(inputCols = ["DayofMonth", "DayOfWeek", "OriginAirportID", "DestAirportID", "DepDelay"], outputCol="features")
lr = LinearRegression(labelCol="label",featuresCol="features")
pipeline = Pipeline(stages=[assembler, lr])

### Tune Parameters
You can tune parameters to find the best model for your data. To do this you can use the  **CrossValidator** class to evaluate each combination of parameters defined in a **ParameterGrid** against multiple *folds* of the data split into training and validation datasets, in order to find the best performing parameters. Note that this can take a long time to run because every parameter combination is tried multiple times.

In [6]:
paramGrid = ParamGridBuilder().addGrid(lr.regParam, [0.3, 0.01]).addGrid(lr.maxIter, [10, 5]).build()
cv = CrossValidator(estimator=pipeline, evaluator=RegressionEvaluator(), estimatorParamMaps=paramGrid, numFolds=2)

model = cv.fit(train)

### Test the Model
Now you're ready to apply the model to the test data.

In [7]:
prediction = model.transform(test)
predicted = prediction.select("features", "prediction", "trueLabel")
predicted.show()

+--------------------+-------------------+---------+
|            features|         prediction|trueLabel|
+--------------------+-------------------+---------+
|[1.0,1.0,10140.0,...|  -7.66882576082442|      -11|
|[1.0,1.0,10140.0,...|-5.6599242675222925|      -18|
|[1.0,1.0,10140.0,...|-5.6599242675222925|      -17|
|[1.0,1.0,10140.0,...| -3.651022774220165|      -12|
|[1.0,1.0,10140.0,...| -3.651022774220165|       -9|
|[1.0,1.0,10140.0,...| 0.2680248602125137|        4|
|[1.0,1.0,10140.0,...|  4.285827846816769|       -9|
|[1.0,1.0,10140.0,...|-6.8651465084467045|      -11|
|[1.0,1.0,10140.0,...| -4.856245015144577|      -11|
|[1.0,1.0,10140.0,...|  8.201614691319254|        5|
|[1.0,1.0,10140.0,...|  20.25502365113202|       14|
|[1.0,1.0,10140.0,...|  -5.86838190948824|       -5|
|[1.0,1.0,10140.0,...| 37.323000196507515|       38|
|[1.0,1.0,10140.0,...|-13.905385364095405|      -19|
|[1.0,1.0,10140.0,...|-13.905385364095405|      -15|
|[1.0,1.0,10140.0,...|  -9.88758237749115|    

### Examine the Predicted and Actual Values
You can plot the predicted values against the actual values to see how accurately the model has predicted. In a perfect model, the resulting scatter plot should form a perfect diagonal line with each predicted value being identical to the actual value - in practice, some variance is to be expected.
Run the cells below to create a temporary table from the **predicted** DataFrame and then retrieve the predicted and actual label values using SQL. You can then display the results as a scatter plot, specifying **-** as the function to show the unaggregated values.

In [8]:
predicted.createOrReplaceTempView("regressionPredictions")

In [9]:
spark.sql("SELECT trueLabel, prediction FROM regressionPredictions").show()

+---------+-------------------+
|trueLabel|         prediction|
+---------+-------------------+
|      -11|  -7.66882576082442|
|      -18|-5.6599242675222925|
|      -17|-5.6599242675222925|
|      -12| -3.651022774220165|
|       -9| -3.651022774220165|
|        4| 0.2680248602125137|
|       -9|  4.285827846816769|
|      -11|-6.8651465084467045|
|      -11| -4.856245015144577|
|        5|  8.201614691319254|
|       14|  20.25502365113202|
|       -5|  -5.86838190948824|
|       38| 37.323000196507515|
|      -19|-13.905385364095405|
|      -15|-13.905385364095405|
|      -25|  -9.88758237749115|
|       -6| -8.883131630840087|
|       -1| -7.878680884189022|
|       -2| -5.869779390886894|
|       -9| -4.865328644235831|
+---------+-------------------+
only showing top 20 rows



### Retrieve the Root Mean Square Error (RMSE)
There are a number of metrics used to measure the variance between predicted and actual values. Of these, the root mean square error (RMSE) is a commonly used value that is measured in the same units as the prediced and actual values - so in this case, the RMSE indicates the average number of minutes between predicted and actual flight delay values. You can use the **RegressionEvaluator** class to retrieve the RMSE.


In [10]:
evaluator = RegressionEvaluator(labelCol="trueLabel", predictionCol="prediction", metricName="rmse")
rmse = evaluator.evaluate(prediction)
print "Root Mean Square Error (RMSE):", rmse

Root Mean Square Error (RMSE): 13.1966381616


### Identify the optimal parameters

In [12]:
bestModel = model.bestModel.stages[-1]
print('Best Param (regParam): {0}'.format(bestModel._java_obj.getRegParam()))
print('Best Param (maxIter): {0}'.format(bestModel._java_obj.getMaxIter()))

Best Param (regParam): 0.01
Best Param (maxIter): 10
