# **Hyperparameter Tuning**

***
The following command adds the pyspark to sys.path at runtime. If the pyspark is not on the system path by default. It also prints the path of the spark. 
***

In [1]:
import findspark
print(findspark.find())
findspark.init() 

/opt/spark


***
# **Create a Spark Session**
***

In [2]:
import pyspark
from pyspark.context import SparkContext
from pyspark.sql.session import SparkSession
#sc = SparkContext('local')
#spark = SparkSession(sc)

spark = SparkSession.builder.appName("Lab-07_Hyperparameter_Tuning").getOrCreate()

Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
22/08/21 10:09:15 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable


***
Import the relevant package from pyspark
1. ParamGridBuilder is used to define the parameter Grid in hyperparameter tuning.
2. TrainValidationSplit and CrossValidator is used to split the data-set.
3. RegressionEvaluator is used evaluate the regression model during hyperparameter tuning. <br>
    3. a. For a two-class classification model use BinaryClassificationEvaluator. <br>
    3. b. For multiclass classification use MulticlassClassificationEvaluator or MultilabelClassificationEvaluator. <br>
    3. c. For a clustering model use ClusteringEvaluator.
4. VectorAssembler is a transformer used to merge multiple columns into a vector column.
***

In [3]:
from pyspark.ml.tuning import ParamGridBuilder
from pyspark.ml.tuning import TrainValidationSplit
from pyspark.ml.evaluation import RegressionEvaluator
from pyspark.ml.tuning import CrossValidator
from pyspark.ml.feature import VectorAssembler


***
toyota.csv <br>

1. model: identifies the model of the vehicle <br>
2. year: denotes the registration year of the vehicle <br>
3. price: denoted the in euros <br>
4. transmission: denotes the type of gearbox (i.e., manual and automatic) <br>
5. mileage: denotes the distance covered by the vehicle in miles<br>
6. fuel type: denotes the of fuel for the vehicle <br>
7. tax: denotes the amount of road tax paid for the vehicle <br>
8. mpg: denotes the number of miles per gallon of fuel covered by the vehicle <br>
9. engine size: denotes engine capacity in number of liters <br>
***


In [4]:
df = spark.read.option("header", True).csv("toyota.csv")

***
View the first 20 rows of the dataset.
***

In [5]:
df.show()

+-----+----+-----+------------+-------+--------+---+----+----------+
|model|year|price|transmission|mileage|fuelType|tax| mpg|engineSize|
+-----+----+-----+------------+-------+--------+---+----+----------+
| GT86|2016|16000|      Manual|  24089|  Petrol|265|36.2|       2.0|
| GT86|2017|15995|      Manual|  18615|  Petrol|145|36.2|       2.0|
| GT86|2015|13998|      Manual|  27469|  Petrol|265|36.2|       2.0|
| GT86|2017|18998|      Manual|  14736|  Petrol|150|36.2|       2.0|
| GT86|2017|17498|      Manual|  36284|  Petrol|145|36.2|       2.0|
| GT86|2017|15998|      Manual|  26919|  Petrol|260|36.2|       2.0|
| GT86|2017|18522|      Manual|  10456|  Petrol|145|36.2|       2.0|
| GT86|2017|18995|      Manual|  12340|  Petrol|145|36.2|       2.0|
| GT86|2020|27998|      Manual|    516|  Petrol|150|33.2|       2.0|
| GT86|2016|13990|      Manual|  37999|  Petrol|265|36.2|       2.0|
| GT86|2013|10495|      Manual|  72000|  Petrol|265|36.2|       2.0|
| GT86|2017|17990|      Manual|  1

***
View the schema of the dataframe
***

In [6]:
df.printSchema()

root
 |-- model: string (nullable = true)
 |-- year: string (nullable = true)
 |-- price: string (nullable = true)
 |-- transmission: string (nullable = true)
 |-- mileage: string (nullable = true)
 |-- fuelType: string (nullable = true)
 |-- tax: string (nullable = true)
 |-- mpg: string (nullable = true)
 |-- engineSize: string (nullable = true)



***
Determine a count of the unique values in each column of the dataframe. To view the unique values, remove the function count() in the below cell.
***

In [7]:
print("Unique values in each column are \n")
for col in df.columns:
    print(col, df.select(col).distinct().count())

Unique values in each column are 

model 18
year 23
price 2114
transmission 4
mileage 5699
fuelType 4
tax 29
mpg 81
engineSize 16


***
Check for any missing values in the dataframe.
***

In [8]:
print('\nCheck for Missing values')
from pyspark.sql.functions import isnan, when, count, col

df.select([count(when(isnan(c), c)).alias(c) for c in df.columns]).show()


Check for Missing values
+-----+----+-----+------------+-------+--------+---+---+----------+
|model|year|price|transmission|mileage|fuelType|tax|mpg|engineSize|
+-----+----+-----+------------+-------+--------+---+---+----------+
|    0|   0|    0|           0|      0|       0|  0|  0|         0|
+-----+----+-----+------------+-------+--------+---+---+----------+



***
Check for any missing values in the dataframe.
***

In [9]:
print('\nCheck for Null values')

from pyspark.sql.functions import isnull, when, count, col

df.select([count(when(col(c).isNull(), c)).alias(c) for c in df.columns]).show()



Check for Null values
+-----+----+-----+------------+-------+--------+---+---+----------+
|model|year|price|transmission|mileage|fuelType|tax|mpg|engineSize|
+-----+----+-----+------------+-------+--------+---+---+----------+
|    0|   0|    0|           0|      0|       0|  0|  0|         0|
+-----+----+-----+------------+-------+--------+---+---+----------+



***
Encode the categorical columns in the dataframe to a numerical value using label indexing. For example, two string values such as 'A' and 'B' will be encoded as '1' and '2'. New columns will be added in the dataframe corresponding to each of the encoded columns.
***

In [10]:
from pyspark.ml.feature import StringIndexer
from pyspark.ml import Pipeline

indexer = [StringIndexer(inputCol='model', outputCol="m_index"), \
    StringIndexer(inputCol='transmission', outputCol="t_index"), \
        StringIndexer(inputCol='fuelType', outputCol="ft_index")]
        
pipeline = Pipeline(stages=indexer)

df_indexed = pipeline.fit(df).transform(df)
df_indexed.show()

+-----+----+-----+------------+-------+--------+---+----+----------+-------+-------+--------+
|model|year|price|transmission|mileage|fuelType|tax| mpg|engineSize|m_index|t_index|ft_index|
+-----+----+-----+------------+-------+--------+---+----+----------+-------+-------+--------+
| GT86|2016|16000|      Manual|  24089|  Petrol|265|36.2|       2.0|   10.0|    0.0|     0.0|
| GT86|2017|15995|      Manual|  18615|  Petrol|145|36.2|       2.0|   10.0|    0.0|     0.0|
| GT86|2015|13998|      Manual|  27469|  Petrol|265|36.2|       2.0|   10.0|    0.0|     0.0|
| GT86|2017|18998|      Manual|  14736|  Petrol|150|36.2|       2.0|   10.0|    0.0|     0.0|
| GT86|2017|17498|      Manual|  36284|  Petrol|145|36.2|       2.0|   10.0|    0.0|     0.0|
| GT86|2017|15998|      Manual|  26919|  Petrol|260|36.2|       2.0|   10.0|    0.0|     0.0|
| GT86|2017|18522|      Manual|  10456|  Petrol|145|36.2|       2.0|   10.0|    0.0|     0.0|
| GT86|2017|18995|      Manual|  12340|  Petrol|145|36.2|   

***
Dropping the columns with the categorical values.
***

In [11]:
df_indexed = df_indexed.drop('model', 'transmission', 'fuelType')
df = df_indexed
df_indexed.show()

+----+-----+-------+---+----+----------+-------+-------+--------+
|year|price|mileage|tax| mpg|engineSize|m_index|t_index|ft_index|
+----+-----+-------+---+----+----------+-------+-------+--------+
|2016|16000|  24089|265|36.2|       2.0|   10.0|    0.0|     0.0|
|2017|15995|  18615|145|36.2|       2.0|   10.0|    0.0|     0.0|
|2015|13998|  27469|265|36.2|       2.0|   10.0|    0.0|     0.0|
|2017|18998|  14736|150|36.2|       2.0|   10.0|    0.0|     0.0|
|2017|17498|  36284|145|36.2|       2.0|   10.0|    0.0|     0.0|
|2017|15998|  26919|260|36.2|       2.0|   10.0|    0.0|     0.0|
|2017|18522|  10456|145|36.2|       2.0|   10.0|    0.0|     0.0|
|2017|18995|  12340|145|36.2|       2.0|   10.0|    0.0|     0.0|
|2020|27998|    516|150|33.2|       2.0|   10.0|    0.0|     0.0|
|2016|13990|  37999|265|36.2|       2.0|   10.0|    0.0|     0.0|
|2013|10495|  72000|265|36.2|       2.0|   10.0|    0.0|     0.0|
|2017|17990|  12597|145|36.2|       2.0|   10.0|    0.0|     0.0|
|2017|1699

***
Casting all the cloumns from any datatype to 'float' type.
***

In [12]:
for col_name in df_indexed.columns:
    df_indexed = df_indexed.withColumn(col_name, col(col_name).cast('float'))

df_indexed.printSchema()

root
 |-- year: float (nullable = true)
 |-- price: float (nullable = true)
 |-- mileage: float (nullable = true)
 |-- tax: float (nullable = true)
 |-- mpg: float (nullable = true)
 |-- engineSize: float (nullable = true)
 |-- m_index: float (nullable = false)
 |-- t_index: float (nullable = false)
 |-- ft_index: float (nullable = false)



In [13]:
df_indexed.show()

+------+-------+-------+-----+----+----------+-------+-------+--------+
|  year|  price|mileage|  tax| mpg|engineSize|m_index|t_index|ft_index|
+------+-------+-------+-----+----+----------+-------+-------+--------+
|2016.0|16000.0|24089.0|265.0|36.2|       2.0|   10.0|    0.0|     0.0|
|2017.0|15995.0|18615.0|145.0|36.2|       2.0|   10.0|    0.0|     0.0|
|2015.0|13998.0|27469.0|265.0|36.2|       2.0|   10.0|    0.0|     0.0|
|2017.0|18998.0|14736.0|150.0|36.2|       2.0|   10.0|    0.0|     0.0|
|2017.0|17498.0|36284.0|145.0|36.2|       2.0|   10.0|    0.0|     0.0|
|2017.0|15998.0|26919.0|260.0|36.2|       2.0|   10.0|    0.0|     0.0|
|2017.0|18522.0|10456.0|145.0|36.2|       2.0|   10.0|    0.0|     0.0|
|2017.0|18995.0|12340.0|145.0|36.2|       2.0|   10.0|    0.0|     0.0|
|2020.0|27998.0|  516.0|150.0|33.2|       2.0|   10.0|    0.0|     0.0|
|2016.0|13990.0|37999.0|265.0|36.2|       2.0|   10.0|    0.0|     0.0|
|2013.0|10495.0|72000.0|265.0|36.2|       2.0|   10.0|    0.0|  

***
1. Determine if the duplicate rows are present or not. <br>
2. Drop or delete the duplicated rows in the dataframe.
***

In [14]:
print("Size of Data before dropping duplicates is ", df.count(), len(df.columns))

print("Number of distinct columns are ", df.distinct().count())
df = df.dropDuplicates()
df_indexed = df_indexed.dropDuplicates()

print("Size of Data after dropping duplicates is ",df.count(), len(df.columns))

Size of Data before dropping duplicates is  6738 9
Number of distinct columns are  6699
Size of Data after dropping duplicates is  6699 9


***
Determine the statistical attributes of the columns in the dataframe.
***

In [15]:
df.summary().show()

[Stage 91:>                                                         (0 + 1) / 1]

+-------+------------------+-----------------+------------------+-----------------+------------------+------------------+-----------------+------------------+-------------------+
|summary|              year|            price|           mileage|              tax|               mpg|        engineSize|          m_index|           t_index|           ft_index|
+-------+------------------+-----------------+------------------+-----------------+------------------+------------------+-----------------+------------------+-------------------+
|  count|              6699|             6699|              6699|             6699|              6699|              6699|             6699|              6699|               6699|
|   mean|2016.7427974324526|12529.79907448873|22889.588744588746| 94.5499328257949|63.078728168383925| 1.471995820271686|2.070607553366174|0.4720107478728168|0.49962680997163755|
| stddev|2.2052707016915356|6358.562625250668|19109.288501160918|73.94264929532783| 15.86103713724215|0.4

                                                                                

***
Determine the Pearson correlation matrix for the columns of the Dataframe.
***

In [16]:
from pyspark.ml.stat import Correlation
from pyspark.ml.feature import VectorAssembler

# convert to vector column first
vector_col = "corr_features"
assembler = VectorAssembler(inputCols=df_indexed.columns, outputCol=vector_col)
df_vector = assembler.transform(df_indexed).select(vector_col)

matrix = Correlation.corr(df_vector, vector_col).head()

22/08/21 10:09:37 WARN InstanceBuilder$NativeBLAS: Failed to load implementation from:dev.ludovic.netlib.blas.JNIBLAS
22/08/21 10:09:37 WARN InstanceBuilder$NativeBLAS: Failed to load implementation from:dev.ludovic.netlib.blas.ForeignLinkerBLAS


***
Display the row of the Pearson correlation matrix corresponding to the row of 'price' (i.e., label column)
***

In [17]:
print("Pearson correlation matrix:\n" + str(matrix[0].toArray()[1]))

Pearson correlation matrix:
[ 0.42281277  1.         -0.30059762  0.21540119 -0.03968017  0.72879115
  0.59063058  0.48990246  0.44138194]


***
Select the top three features that are highly correlated with the label column (i.e., 'price'), and drop the remaining columns.
***

In [18]:
df_input = df_indexed.drop('year','mileage', 'tax', 'mpg', 'ft_index') 
df_input.show(5)

+-------+----------+-------+-------+
|  price|engineSize|m_index|t_index|
+-------+----------+-------+-------+
|15790.0|       1.2|    5.0|    0.0|
|27570.0|       2.5|    4.0|    1.0|
|18995.0|       2.5|    4.0|    1.0|
|10311.0|       2.2|    4.0|    0.0|
|14795.0|       2.0|    4.0|    0.0|
+-------+----------+-------+-------+
only showing top 5 rows



***
1. Apply Min-Max normalization technique to the selected features.
2. Create a vector of features using Vector Assembler.
***

In [19]:
from pyspark.ml.feature import MinMaxScaler

assembler = VectorAssembler(inputCols=['engineSize','m_index','t_index'], outputCol='features')
df_input = assembler.transform(df_input)

scaler = MinMaxScaler(inputCol="features", outputCol="scaled_features")

df_input = scaler.fit(df_input).transform(df_input)
df_input.show(5)

+-------+----------+-------+-------+--------------------+--------------------+
|  price|engineSize|m_index|t_index|            features|     scaled_features|
+-------+----------+-------+-------+--------------------+--------------------+
|15790.0|       1.2|    5.0|    0.0|[1.20000004768371...|[0.26666667726304...|
|27570.0|       2.5|    4.0|    1.0|       [2.5,4.0,1.0]|[0.55555555555555...|
|18995.0|       2.5|    4.0|    1.0|       [2.5,4.0,1.0]|[0.55555555555555...|
|10311.0|       2.2|    4.0|    0.0|[2.20000004768371...|[0.48888889948527...|
|14795.0|       2.0|    4.0|    0.0|       [2.0,4.0,0.0]|[0.44444444444444...|
+-------+----------+-------+-------+--------------------+--------------------+
only showing top 5 rows



***
Apply Min-Max normalization on the 'price' column.
***

In [20]:
assembler = VectorAssembler(inputCols=['price'], outputCol='label')
df_input = assembler.transform(df_input)

scaler = MinMaxScaler(inputCol="label", outputCol="scaled_price")
df_input = scaler.fit(df_input).transform(df_input)
df_input.show(5)

+-------+----------+-------+-------+--------------------+--------------------+---------+--------------------+
|  price|engineSize|m_index|t_index|            features|     scaled_features|    label|        scaled_price|
+-------+----------+-------+-------+--------------------+--------------------+---------+--------------------+
|15790.0|       1.2|    5.0|    0.0|[1.20000004768371...|[0.26666667726304...|[15790.0]|[0.2525995434948009]|
|27570.0|       2.5|    4.0|    1.0|       [2.5,4.0,1.0]|[0.55555555555555...|[27570.0]|[0.4517710710964578]|
|18995.0|       2.5|    4.0|    1.0|       [2.5,4.0,1.0]|[0.55555555555555...|[18995.0]|[0.30678840138642...|
|10311.0|       2.2|    4.0|    0.0|[2.20000004768371...|[0.48888889948527...|[10311.0]|[0.15996280328007...|
|14795.0|       2.0|    4.0|    0.0|       [2.0,4.0,0.0]|[0.44444444444444...|[14795.0]|[0.23577648152844...|
+-------+----------+-------+-------+--------------------+--------------------+---------+--------------------+
only showi

***
The label column ('price') should be of type 'float' or 'double' for providing it to the regression model. <br>
Since the Min-Max normalization generates the output as a vector. <br>
We convert the label column to a data type of 'float'
***

In [21]:
from pyspark.sql import functions as F
from pyspark.sql.types import FloatType
#from pyspark.ml.functions import array_to_vector

df_data = df_input.select('scaled_features','scaled_price')
df_data.printSchema()

udf1 = F.udf(lambda x : float(x[0]),FloatType())
df_data = df_input.withColumn('scaled_price',udf1('scaled_price').alias('scaled_price_')).select('scaled_features','scaled_price')
df_data.show(5)
df_data.printSchema()

root
 |-- scaled_features: vector (nullable = true)
 |-- scaled_price: vector (nullable = true)



[Stage 127:>                                                        (0 + 1) / 1]

+--------------------+------------+
|     scaled_features|scaled_price|
+--------------------+------------+
|[0.26666667726304...|  0.25259954|
|[0.55555555555555...|  0.45177108|
|[0.55555555555555...|   0.3067884|
|[0.48888889948527...|   0.1599628|
|[0.44444444444444...|  0.23577648|
+--------------------+------------+
only showing top 5 rows

root
 |-- scaled_features: vector (nullable = true)
 |-- scaled_price: float (nullable = true)



                                                                                

***
We divide the Data-set into Training and Testing parts in a ratio of 70-30.
***

In [22]:
train, test = df_data.randomSplit([0.7, 0.3], seed=1234)

In [23]:
train.show(n = 5, truncate=False)

+------------------------------+------------+
|scaled_features               |scaled_price|
+------------------------------+------------+
|[0.0,0.058823529411764705,0.0]|0.12088934  |
|[0.0,0.058823529411764705,0.0]|0.12088934  |
|[0.0,0.1764705882352941,0.0]  |0.22740722  |
|[0.2222222222222222,0.0,0.0]  |0.00211345  |
|[0.2222222222222222,0.0,0.0]  |0.01183532  |
+------------------------------+------------+
only showing top 5 rows



[Stage 130:>                                                        (0 + 1) / 1]                                                                                

# Hyperparameter Tuning

***
1. Configure a Linear Regression Model by specifying the input column, output column, and the maximum number of iterations. <br>
2. Define the Parameter Grid for the Linear Regression Model.
***

In [24]:
from pyspark.ml.regression import LinearRegression
lr = LinearRegression(featuresCol="scaled_features", labelCol='scaled_price', maxIter=100)

paramGrid = ParamGridBuilder() \
    .addGrid(lr.elasticNetParam, [0.2, 0.8]) \
    .addGrid(lr.regParam, [0.3]) \
    .build()

***
Use the CrossValidator for hyperparmeter tuning for the Linear Regression model using the defined parameter grid.
***

In [25]:
crossval = CrossValidator(estimator=lr,
                          estimatorParamMaps=paramGrid,
                          evaluator=RegressionEvaluator(labelCol="scaled_price", predictionCol="prediction", metricName="rmse"),
                          numFolds=2,
                          parallelism=2)

***
Train the Linear regression model with crossvalidator
***

In [26]:
cvModel = crossval.fit(train)

                                                                                

***
Obtain the best model
***

In [27]:
print(cvModel.bestModel)

LinearRegressionModel: uid=LinearRegression_597882899058, numFeatures=3


***
Display the details of all the models trained 
***

In [28]:
list(zip(cvModel.avgMetrics, cvModel.getEstimatorParamMaps()))

[(0.10215530060177624,
  {Param(parent='LinearRegression_597882899058', name='elasticNetParam', doc='the ElasticNet mixing parameter, in range [0, 1]. For alpha = 0, the penalty is an L2 penalty. For alpha = 1, it is an L1 penalty.'): 0.2,
   Param(parent='LinearRegression_597882899058', name='regParam', doc='regularization parameter (>= 0).'): 0.3}),
 (0.10571605653888813,
  {Param(parent='LinearRegression_597882899058', name='elasticNetParam', doc='the ElasticNet mixing parameter, in range [0, 1]. For alpha = 0, the penalty is an L2 penalty. For alpha = 1, it is an L1 penalty.'): 0.8,
   Param(parent='LinearRegression_597882899058', name='regParam', doc='regularization parameter (>= 0).'): 0.3})]

***
Determine the Mean squared error of the best model on the test data
***

In [29]:
e_summary = cvModel.bestModel.evaluate(test)

In [30]:
print(e_summary.rootMeanSquaredError)

0.1046462215018284


***
Determine the predictions of the best model on the test data
***

In [31]:
predict = cvModel.bestModel.transform(test)

In [32]:
predict.show(10)

+--------------------+------------+-------------------+
|     scaled_features|scaled_price|         prediction|
+--------------------+------------+-------------------+
|           (3,[],[])|  0.19359203| 0.1786395857079632|
|[0.0,0.0,0.333333...|  0.17161214| 0.1786395857079632|
|[0.0,0.0588235294...|  0.15132302|0.17867289303133807|
|[0.22222222222222...|  0.07515428| 0.1914056336096327|
|[0.22222222222222...|  0.08196805| 0.1914056336096327|
|[0.22222222222222...|  0.08536647| 0.1914056336096327|
|[0.22222222222222...| 0.086735986| 0.1914056336096327|
|[0.22222222222222...|  0.08690506| 0.1914056336096327|
|[0.22222222222222...|  0.08842675| 0.1914056336096327|
|[0.22222222222222...|  0.09197734| 0.1914056336096327|
+--------------------+------------+-------------------+
only showing top 10 rows



## Repeat the hyperparmeter tuning process for the Random Forest Regressor

***
1. Configure the Random Forest Regressor <br>
2. Define Parameter Grid <br>
3. Use the Cross Validator for hyperparameter tuning <br>
4. Determine the best model <br>
5. Display the details of all the models trained <br>
6. Evaluate the best model on the test data and display the MSE <br>
***

In [33]:
from pyspark.ml.regression import RandomForestRegressor

rf = RandomForestRegressor(featuresCol="scaled_features", labelCol='scaled_price')

In [34]:
paramGrid = ParamGridBuilder() \
    .addGrid(rf.numTrees, [1, 10]) \
    .addGrid(rf.maxDepth, [5]) \
    .build()

In [35]:
crossval = CrossValidator(estimator=rf,
                          estimatorParamMaps=paramGrid,
                          evaluator=RegressionEvaluator(labelCol="scaled_price", predictionCol="prediction", metricName="rmse"),
                          numFolds=2,
                          parallelism=2)

In [36]:
cvModel = crossval.fit(train)

                                                                                

In [37]:
print(cvModel.bestModel)

RandomForestRegressionModel: uid=RandomForestRegressor_2deb64e860a4, numTrees=1, numFeatures=3


In [38]:
list(zip(cvModel.avgMetrics, cvModel.getEstimatorParamMaps()))

[(0.04971509428578412,
  {Param(parent='RandomForestRegressor_2deb64e860a4', name='numTrees', doc='Number of trees to train (>= 1).'): 1,
   Param(parent='RandomForestRegressor_2deb64e860a4', name='maxDepth', doc='Maximum depth of the tree. (>= 0) E.g., depth 0 means 1 leaf node; depth 1 means 1 internal node + 2 leaf nodes. Must be in range [0, 30].'): 5}),
 (0.05028101498389054,
  {Param(parent='RandomForestRegressor_2deb64e860a4', name='numTrees', doc='Number of trees to train (>= 1).'): 10,
   Param(parent='RandomForestRegressor_2deb64e860a4', name='maxDepth', doc='Maximum depth of the tree. (>= 0) E.g., depth 0 means 1 leaf node; depth 1 means 1 internal node + 2 leaf nodes. Must be in range [0, 30].'): 5})]

In [39]:
predict = cvModel.bestModel.transform(test)

In [40]:
predict.show(10)

+--------------------+------------+-------------------+
|     scaled_features|scaled_price|         prediction|
+--------------------+------------+-------------------+
|           (3,[],[])|  0.19359203|0.11404406423432714|
|[0.0,0.0,0.333333...|  0.17161214|0.14120883116789903|
|[0.0,0.0588235294...|  0.15132302|0.11404406423432714|
|[0.22222222222222...|  0.07515428|0.11404406423432714|
|[0.22222222222222...|  0.08196805|0.11404406423432714|
|[0.22222222222222...|  0.08536647|0.11404406423432714|
|[0.22222222222222...| 0.086735986|0.11404406423432714|
|[0.22222222222222...|  0.08690506|0.11404406423432714|
|[0.22222222222222...|  0.08842675|0.11404406423432714|
|[0.22222222222222...|  0.09197734|0.11404406423432714|
+--------------------+------------+-------------------+
only showing top 10 rows



In [41]:
evaluator = RegressionEvaluator(labelCol="scaled_price", predictionCol="prediction", metricName="rmse")
rmse = evaluator.evaluate(predict)

In [42]:
print(rmse)

0.0526963909254728


In [43]:
spark.stop()