##### Grading Feedback

# IST 718: Big Data Analytics

- Professor: Willard Williamson <wewillia@syr.edu>
- Faculty Assistant: Vidushi Mishra <vmishr01@syr.edu>
- Faculty Assistant: Pranav Kottoli Radhakrishna <pkottoli@syr.edu>
## General instructions:

- You are welcome to discuss the problems with your classmates but __you are not allowed to copy any part of your answers from your classmates.  Short code snippets are allowed from the internet.  Code from the class text books or class provided code can be copied in its entirety.__
- Do not modify cells marked as grading cells or marked as do not modify.
- Before submitting your work, remember to check for run time errors with the following procedure:
`Runtime `$\rightarrow$ Factory reset runtime followed by Runtime $\rightarrow$ Run All.  All runtime errors will result in a minimum penalty of half off.
- Google Colab is the official class runtime environment so you should test your code on Colab before submission.
- All plots shall include descriptive title and axis labels.  Plot legends shall be included where possible.  Unless stated otherwise, plots can be made using any Python plotting package.  It is understood that spark data structures must be converted to something like numpy or pandas prior to making plots.  All required mathematical operations, filtering, selection, etc., required by a homework question shall be performed in spark prior to converting to numpy or pandas.
- Grading feedback cells are there for graders to provide feedback to students.  Don't change or remove grading feedback cells.
- Don't add or remove files from your git repo.
- Do not change file names in your repo.  This also means don't change the title of the ipython notebook.
- You are free to add additional code cells around the cells marked `your code here`.
- We reserve the right to take points off for operations that are extremely inefficient or "heavy weight".  This is a big data class and extremely inefficient operations make a big difference when scaling up to large data sets.  For example, the spark dataframe collect() method is a very heavy weight operation and should not be used unless it there is a real need for it.  An example where collect() might be needed is to get ready to make a plot after filtering a spark dataframe.
- import * is not allowed because it is considered a very bad coding practice and in some cases can result in a significant delay (which slows down the grading process) in loading imports.  For example, the statement `from sympy import *` is not allowed.  You must import the specific packages that you need. 
- The graders reserve the right to deduct points for subjective things we see with your code.  For example, if we ask you to create a pandas data frame to display values from an investigation and you hard code the values, we will take points off for that.  This is only one of many different things we could find in reviewing your code.  In general, write your code like you are submitting it for a code peer review in industry.  
- Level of effort is part of our subjective grading.  For example, in cases where we ask for a more open ended investigation, some students put in significant effort and some students do the minimum possible to meet requirements.  In these cases, we may take points off for students who did not put in much effort as compared to students who put in a lot of effort.  We feel that the students who did a better job deserve a better grade.  We reserve the right to invoke level of effort grading at any time.
- Only use spark, spark machine learning, spark data frames, RDD's, and map reduce to solve all problems unless instructed otherwise.
- Your notebook must run from start to finish without requiring manual input by the graders.  For example, do not mount your personal Google drive in your notebook as this will require graders to perform manual steps.  In short, your notebook should run from start to finish with no runtime errors and no need for graders to perform any manual steps.


## Note that this notebook is expected to run in the Google Colab environment.  All grading for this assignment will take place exclusively in Google Colab.

This homework proves that diamonds are forever.  In homework 3, we used linear regression to predict diamond prices and evaluated model performance using MSE as the scoring metric.  In this homework, we are going to use the same diamonds data set but this time use decision trees and deep learning to see if we can improve upon the linear regression performance from homework 3.

# Diamonds Data
Just to prove that diamonds are forever, we are going to revisit the diamonds data set.  This homework assignment will use the diamonds dataset to explore random forest decision tree models.

The diamonds.csv data set contains 10 columns:
- carat: Carat weight of the diamond
- cut: Describes cut quality of the diamond. Quality in increasing order Fair, Good, Very Good, Premium, Ideal
- color: Color of the diamond, with D being the best and J the worst
- clarity: How obvious inclusions are within the diamond:(in order from best to worst, FL = flawless, I3= level 3 inclusions) FL,IF, VVS1, etc.  See this web site for an exhaustive ranking of [clarity](https://4cs.gia.edu/en-us/diamond-clarity/?gclid=Cj0KCQjwnqH7BRDdARIsACTSAduMoc2KQbXkO94BxCfBNC5X8YyjAYcFpWThKQMW46cQj_3p0pZ0o84aAuagEALw_wcB).  The web site has a nice sliding scale you can drag to see the relationship between clarity grades.
- depth: depth % - The height of a diamond, measured from the culet to the table, divided by its average girdle diameter
- table: table% -  The width of the diamond's table expressed as a percentage of its average diameter
- price: The price of the diamond
- x: Length (mm)
- y: Width (mm)
- z: Height (mm)

In [1]:
# Grading Cell
enable_grid_search = False

The following cell is used to read the diamonds data set into the colab environment.  Do not change or modify the following cell.

In [2]:
%%bash
# Do not change or modify this file
# Need to install pyspark
# if pyspark is already installed, will print a message indicating pyspark already isntalled
pip install pyspark

# Download the data files from github
# If the data file does not exist in the colab environment
if [[ ! -f ./quotes_by_char.csv ]]; then 
   # download the data file from github and save it in this colab environment instance
   wget https://raw.githubusercontent.com/wewilli1/ist718_data/master/diamonds.csv  
fi

Collecting pyspark
  Downloading https://files.pythonhosted.org/packages/f0/26/198fc8c0b98580f617cb03cb298c6056587b8f0447e20fa40c5b634ced77/pyspark-3.0.1.tar.gz (204.2MB)
Collecting py4j==0.10.9
  Downloading https://files.pythonhosted.org/packages/9e/b6/6a4fb90cd235dc8e265a6a2067f2a2c99f0d91787f06aca4bcf7c23f3f80/py4j-0.10.9-py2.py3-none-any.whl (198kB)
Building wheels for collected packages: pyspark
  Building wheel for pyspark (setup.py): started
  Building wheel for pyspark (setup.py): finished with status 'done'
  Created wheel for pyspark: filename=pyspark-3.0.1-py2.py3-none-any.whl size=204612243 sha256=08e2a39e9b9a6d38da88ce0dc1412a18ba93662a855ef35b54b5000ff995fe2a
  Stored in directory: /root/.cache/pip/wheels/5e/bd/07/031766ca628adec8435bb40f0bd83bb676ce65ff4007f8e73f
Successfully built pyspark
Installing collected packages: py4j, pyspark
Successfully installed py4j-0.10.9 pyspark-3.0.1


--2020-11-23 01:39:59--  https://raw.githubusercontent.com/wewilli1/ist718_data/master/diamonds.csv
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 151.101.0.133, 151.101.64.133, 151.101.128.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|151.101.0.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 3192560 (3.0M) [text/plain]
Saving to: ‘diamonds.csv’

     0K .......... .......... .......... .......... ..........  1% 3.99M 1s
    50K .......... .......... .......... .......... ..........  3% 8.72M 1s
   100K .......... .......... .......... .......... ..........  4% 5.28M 1s
   150K .......... .......... .......... .......... ..........  6% 16.7M 0s
   200K .......... .......... .......... .......... ..........  8% 6.80M 0s
   250K .......... .......... .......... .......... ..........  9% 23.8M 0s
   300K .......... .......... .......... .......... .......... 11% 40.9M 0s
   350K .......... .......... ........

In [3]:
import os
from pyspark.sql import SparkSession

# Thanks to Brian Schramke for the following spark session code
MAX_MEMORY = "12g"

spark = SparkSession \
  .builder \
  .master("local[*]")\
  .config("spark.memory.fraction", 0.8) \
  .config("spark.executor.memory", MAX_MEMORY) \
  .config("spark.driver.memory", MAX_MEMORY)\
  .config("spark.memory.offHeap.enabled",'true')\
  .config("spark.memory.offHeap.size",MAX_MEMORY)\
  .getOrCreate()

sc = spark.sparkContext

import matplotlib.pyplot as plt
from pyspark.ml import clustering
from pyspark.ml import feature, regression, evaluation, Pipeline
from pyspark.sql import SparkSession
from pyspark.sql import SQLContext
import numpy as np
import pandas as pd
from pyspark.sql.functions import *
from pyspark.ml.feature import StringIndexerModel
sqlContext = SQLContext(sc)
from pyspark.ml.regression import RandomForestRegressor
from pyspark.ml.feature import VectorAssembler
from pyspark.ml.evaluation import RegressionEvaluator
from pyspark.ml.regression import GBTRegressor
from pyspark.ml.tuning import ParamGridBuilder, CrossValidator

print(sc.version)

3.0.1


# Question 0 (0 pts)
Please provide the following the data so we can easily correlate your notebook with the grade book:
- Your Name: Dingyu Sun
- Your github user name: DingyuSun
- Your SU email address: dsun11@syr.edu

Your grade for grid search problems in this assignment will be determined in part on level of effort and your model performance results as compared to other students in the class.

# Question 1 (10 pts)
Read the diamonds.csv file into a spark data frame named `diamonds_df`.  Perform feature engineering as needed for training decision trees.  Name the new data frame diamonds_df_xformed.

In [4]:
# your code here
diamonds_df = spark.read.format("csv").option("header","true").load("diamonds.csv")

diamonds_df = diamonds_df.selectExpr("cast(price as Float) price",
                                     "cast(carat as Float) carat",
                                     "cast(cut as String) cut",
                                     "cast(color as String) color",
                                     "cast(clarity as String) clarity",
                                     "cast(depth as Float) depth",
                                     "cast(table as Integer) table",
                                    "cast(x as Float) x",
                                    "cast(y as Float) y",
                                    "cast(z as Float) z")


In [5]:
# transform the variable from categorical to numerical variable
trans1 = StringIndexerModel.from_labels(['Fair', 'Good', 'Very Good','Premium', 'Ideal'],inputCol="cut",outputCol="cut_idx")
trans2 = StringIndexerModel.from_labels(['J', 'I', 'H', 'G', 'F', 'E', 'D'],inputCol="color",outputCol="color_idx")
trans3 = StringIndexerModel.from_labels(['I1', 'SI2', 'SI1','VS2','VS1', 'VVS2', 'VVS1', 'IF'],inputCol="clarity",outputCol="clarity_idx")

transformationPipeline = Pipeline().setStages([trans1, trans2, trans3])

fittedPipeline = transformationPipeline.fit(diamonds_df)

transformedTraining = fittedPipeline.transform(diamonds_df)
transformedTraining = transformedTraining.withColumn("log_price", log("price"))

In [6]:
diamonds_df_xformed = transformedTraining.drop(*['cut','color','clarity','price'])
diamonds_df_xformed = diamonds_df_xformed.withColumnRenamed("cut_idx","cut").withColumnRenamed("color_idx","color").withColumnRenamed("clarity_idx","clarity")\
.withColumnRenamed("log_price","price")
diamonds_df_xformed.show()

+-----+-----+-----+----+----+----+---+-----+-------+------------------+
|carat|depth|table|   x|   y|   z|cut|color|clarity|             price|
+-----+-----+-----+----+----+----+---+-----+-------+------------------+
| 0.23| 61.5|   55|3.95|3.98|2.43|4.0|  5.0|    1.0| 5.786897381366708|
| 0.21| 59.8|   61|3.89|3.84|2.31|3.0|  5.0|    2.0| 5.786897381366708|
| 0.23| 56.9|   65|4.05|4.07|2.31|1.0|  5.0|    4.0|5.7899601708972535|
| 0.29| 62.4|   58| 4.2|4.23|2.63|3.0|  1.0|    3.0|   5.8111409929767|
| 0.31| 63.3|   58|4.34|4.35|2.75|1.0|  0.0|    1.0| 5.814130531825066|
| 0.24| 62.8|   57|3.94|3.96|2.48|2.0|  0.0|    5.0| 5.817111159963204|
| 0.24| 62.3|   57|3.95|3.98|2.47|2.0|  1.0|    6.0| 5.817111159963204|
| 0.26| 61.9|   55|4.07|4.11|2.53|2.0|  2.0|    2.0| 5.820082930352362|
| 0.22| 65.1|   61|3.87|3.78|2.49|0.0|  5.0|    3.0| 5.820082930352362|
| 0.23| 59.4|   61| 4.0|4.05|2.39|2.0|  2.0|    4.0| 5.823045895483019|
|  0.3| 64.0|   55|4.25|4.28|2.73|1.0|  0.0|    2.0|  5.82600010

In [7]:
feature_list = []
for col in diamonds_df_xformed.columns:
    if col == 'price':
        continue
    else:
        feature_list.append(col)

In [8]:
# Grading Cell - do not modify
display(diamonds_df_xformed.toPandas().head())

Unnamed: 0,carat,depth,table,x,y,z,cut,color,clarity,price
0,0.23,61.5,55,3.95,3.98,2.43,4.0,5.0,1.0,5.786897
1,0.21,59.799999,61,3.89,3.84,2.31,3.0,5.0,2.0,5.786897
2,0.23,56.900002,65,4.05,4.07,2.31,1.0,5.0,4.0,5.78996
3,0.29,62.400002,58,4.2,4.23,2.63,3.0,1.0,3.0,5.811141
4,0.31,63.299999,58,4.34,4.35,2.75,1.0,0.0,1.0,5.814131


##### Grading Feedback Cell

The following questions will create a random forest regressor model, train the model using a grid search, and use the model for inference.  The goal is to see if we can improve upon the linear regression score from homework 3. You can find the spark documentation for the random forest regressor [here](https://spark.apache.org/docs/latest/ml-classification-regression.html#random-forest-regression).

# Question 2 (20 pts)
Create and train your random forest regressor model using a grid search in the cell below.  You are free to use K-Fold Cross validation if you wish.  Your grid search must be entirely encapsulated in the `if enable_grid_search` if statement.  The `enable_grid_search` Boolean is defined in a grading cell above.  You will disable the grid search before you submit by setting enable_grid_search to false.  Setting enable_grid_search to false should not result in a runtime error.  You will not receive full credit if any part of your grid search is outside of the if statement or if runtime errros result from setting the `enable_grid_search` variable to false.

In [9]:
# your code here
if enable_grid_search:
  training_df, testing_df = diamonds_df_xformed.randomSplit([0.6, 0.4], seed=7)
  va = VectorAssembler().setInputCols(feature_list).setOutputCol('features')
  rf = RandomForestRegressor(labelCol="price",featuresCol="features")

  rf_pipeline = Pipeline(stages=[va,rf])
  
  paramGrid = ParamGridBuilder().addGrid(rf.numTrees, [int(x) for x in [25,30]])\
  .addGrid(rf.maxDepth, [int(x) for x in [10,15]]).build()

  crossval = CrossValidator(estimator=rf_pipeline,
                            estimatorParamMaps=paramGrid,
                            evaluator=RegressionEvaluator(labelCol="price"),
                            numFolds=3)
  
  cvModel = crossval.fit(training_df)
  predictions = cvModel.transform(testing_df)

  bestPipeline = cvModel.bestModel
  bestModel = bestPipeline.stages[1]
  print('numTrees - ', bestModel.getNumTrees)
  print('maxDepth - ', bestModel.getOrDefault('maxDepth'))
  evaluator = RegressionEvaluator(labelCol="price", predictionCol="prediction", metricName="mse")
  mse = evaluator.evaluate(predictions)
  print(mse)
  
  pass

##### Grading Feedback Cell

# Question 3 (20 pts)
Create a pipeline named `best_pipe` that hard codes the tuning parameters from the best model found by the grid search in question 2 above.  Train and test best_pipe.  Do not use k-fold cross validation in question 3.  Clearly print the resulting train and test MSE for best_pipe so it's easy for the graders to see your resulting MSEs.

In [10]:
# Your code here
training_df, testing_df = diamonds_df_xformed.randomSplit([0.6, 0.4], seed=7)
va = VectorAssembler().setInputCols(feature_list).setOutputCol('features')
rf = RandomForestRegressor(labelCol="price",featuresCol="features",numTrees= 30 , maxDepth= 15)
best_pipe = Pipeline(stages=[va,rf])

model = best_pipe.fit(training_df)
predictions = model.transform(testing_df)
evaluator = RegressionEvaluator(
    labelCol="price", predictionCol="prediction", metricName="mse")
mse_rf = evaluator.evaluate(predictions)
print("Root Mean Squared Error (MSE) on test data = %g" % mse_rf)
predictions2 = model.transform(training_df)
mse_rf2 = evaluator.evaluate(predictions2)
print("Root Mean Squared Error (MSE) on train data = %g" % mse_rf2)

Root Mean Squared Error (MSE) on test data = 0.0111432
Root Mean Squared Error (MSE) on train data = 0.0052335


##### Grading Feedback Cell

# Question 4 (20 pts)
Use your best_pipe pipeline in question 3 for inference.  Create a pandas data frame named `rf_feature_importance` which contains 2 columns: `feature`, and `importance`.  Load the feature column with the feature name and the importance column with the feature importance score as determined by the random forest model. Sort the feature importances from high to low such that the most important feature is in the first row of the data frame.

In [11]:
# your code here
rf_model = model.stages[-1]
rf_feature_importance = pd.DataFrame(list(zip(feature_list, rf_model.featureImportances.toArray())),
                                     columns = ['feature', 'importance']).sort_values('importance',ascending=False)

In [12]:
print(feature_list)

['carat', 'depth', 'table', 'x', 'y', 'z', 'cut', 'color', 'clarity']


In [13]:
# grading cell - do not modify
display(rf_feature_importance)

Unnamed: 0,feature,importance
4,y,0.411559
0,carat,0.287167
5,z,0.126911
3,x,0.123147
8,clarity,0.029337
7,color,0.014071
6,cut,0.003255
1,depth,0.00263
2,table,0.001924


##### Grading Feedback Cell

# Question 5 (20 pts)
Write code to print the decision logic for any of the trees in the forest from the best_pipe pipeline.  Copy the printed decision text to the tree printout markdown cell below and retain the same formatting and indentation as the code printout so it's easy for the graders to view the data.  You need to double click the "Your Decision Tree Print Out Here" markdown cell and paste your output inside the two sets of triple quotes. The triple quotes are jupyter markdown indicating you want to present code.  Essentially, replace the text inside the triple quotes with your tree printout.  Solutions that do not maintain readable formatting will not receive full credit.

Add comments to the markdown cell below describing how the root node is split:  Describe 2 things in the markdown cell.  1) What specific predictor variable is being split and what is the value that determines the left / right split.  2) We need you to paste the tree decision logic output from your run in the markdown cell because the top level split may change from run to run.  If the graders run your notebook, the top level split for the tree may be different than the top level split from when you made the run.  Describe why the top level predictor changes from run to run.


In [14]:
# your code here
training_df, testing_df = diamonds_df_xformed.randomSplit([0.6, 0.4], seed=7)
va = VectorAssembler().setInputCols(feature_list).setOutputCol('features')
rf = RandomForestRegressor(labelCol="price",featuresCol="features")
best_pipe = Pipeline(stages=[va,rf])
model = best_pipe.fit(training_df)
rf_model = model.stages[-1]
len(rf_model.trees)
print(rf_model.trees[0].toDebugString)

DecisionTreeRegressionModel: uid=dtr_532870965cdb, depth=5, numNodes=63, numFeatures=9
  If (feature 0 <= 0.6949999928474426)
   If (feature 3 <= 4.855000019073486)
    If (feature 5 <= 2.834999918937683)
     If (feature 8 in {0.0,1.0,2.0,3.0,4.0,5.0})
      If (feature 3 <= 4.305000066757202)
       Predict: 6.300886494284585
      Else (feature 3 > 4.305000066757202)
       Predict: 6.5323148510684526
     Else (feature 8 not in {0.0,1.0,2.0,3.0,4.0,5.0})
      If (feature 0 <= 0.29500000178813934)
       Predict: 6.367088331952799
      Else (feature 0 > 0.29500000178813934)
       Predict: 6.776402149873627
    Else (feature 5 > 2.834999918937683)
     If (feature 8 in {0.0,1.0,2.0,3.0})
      If (feature 3 <= 4.704999923706055)
       Predict: 6.5787314077839065
      Else (feature 3 > 4.704999923706055)
       Predict: 6.719475326723084
     Else (feature 8 not in {0.0,1.0,2.0,3.0})
      If (feature 7 in {0.0,1.0,2.0})
       Predict: 6.770476594042477
      Else (feature 7 not



```
DecisionTreeRegressionModel: uid=dtr_1d5cf945f62d, depth=5, numNodes=63, numFeatures=9
  If (feature 4 <= 5.694999933242798)
   If (feature 5 <= 2.9950000047683716)
    If (feature 5 <= 2.834999918937683)
     If (feature 5 <= 2.625)
      If (feature 4 <= 4.255000114440918)
       Predict: 6.26336079504394
      Else (feature 4 > 4.255000114440918)
       Predict: 6.457152306446421
     Else (feature 5 > 2.625)
      If (feature 3 <= 4.3450000286102295)
       Predict: 6.4652104428683606
      Else (feature 3 > 4.3450000286102295)
       Predict: 6.603775459504828
    Else (feature 5 > 2.834999918937683)
     If (feature 3 <= 4.694999933242798)
      If (feature 7 in {0.0,1.0,2.0})
       Predict: 6.559656376328727
      Else (feature 7 not in {0.0,1.0,2.0})
       Predict: 6.75874519838659
     Else (feature 3 > 4.694999933242798)
      If (feature 7 in {0.0,1.0,2.0})
       Predict: 6.727517527584509
      Else (feature 7 not in {0.0,1.0,2.0})
       Predict: 6.904558071067402
   Else (feature 5 > 2.9950000047683716)
    If (feature 0 <= 0.4950000047683716)
     If (feature 4 <= 4.855000019073486)
      If (feature 6 in {0.0,1.0,2.0})
       Predict: 6.7531461889478335
      Else (feature 6 not in {0.0,1.0,2.0})
       Predict: 6.894183089754366
     Else (feature 4 > 4.855000019073486)
      If (feature 7 in {0.0,1.0,2.0,4.0})
       Predict: 6.869828323672314
      Else (feature 7 not in {0.0,1.0,2.0,4.0})
       Predict: 7.095892006854079
    Else (feature 0 > 0.4950000047683716)
     If (feature 3 <= 5.545000076293945)
      If (feature 8 in {0.0,1.0,2.0})
       Predict: 7.201574360365427
      Else (feature 8 not in {0.0,1.0,2.0})
       Predict: 7.516769795222332
     Else (feature 3 > 5.545000076293945)
      If (feature 1 <= 66.14999771118164)
       Predict: 7.744220811779731
      Else (feature 1 > 66.14999771118164)
       Predict: 7.186405444034211
  Else (feature 4 > 5.694999933242798)
   If (feature 3 <= 6.855000019073486)
    If (feature 4 <= 6.164999961853027)
     If (feature 3 <= 5.994999885559082)
      If (feature 5 <= 4.144999980926514)
       Predict: 7.919023654796459
      Else (feature 5 > 4.144999980926514)
       Predict: 6.148468295917458
     Else (feature 3 > 5.994999885559082)
      If (feature 5 <= 3.7050000429153442)
       Predict: 7.925798217945129
      Else (feature 5 > 3.7050000429153442)
       Predict: 8.20488837347516
    Else (feature 4 > 6.164999961853027)
     If (feature 8 in {0.0,1.0,2.0})
      If (feature 7 in {0.0,1.0})
       Predict: 8.314590112323133
      Else (feature 7 not in {0.0,1.0})
       Predict: 8.431009468865863
     Else (feature 8 not in {0.0,1.0,2.0})
      If (feature 8 in {3.0,4.0})
       Predict: 8.711271673279022
      Else (feature 8 not in {3.0,4.0})
       Predict: 9.087604846422645
   Else (feature 3 > 6.855000019073486)
    If (feature 3 <= 7.325000047683716)
     If (feature 8 in {0.0,1.0,2.0})
      If (feature 5 <= 4.3450000286102295)
       Predict: 8.646650593784521
      Else (feature 5 > 4.3450000286102295)
       Predict: 8.919184896721251
     Else (feature 8 not in {0.0,1.0,2.0})
      If (feature 3 <= 7.0350000858306885)
       Predict: 9.036203943260112
      Else (feature 3 > 7.0350000858306885)
       Predict: 9.235303892268535
    Else (feature 3 > 7.325000047683716)
     If (feature 5 <= 4.625)
      If (feature 8 in {0.0,1.0})
       Predict: 8.990546586516505
      Else (feature 8 not in {0.0,1.0})
       Predict: 9.327282325328424
     Else (feature 5 > 4.625)
      If (feature 8 in {0.0})
       Predict: 8.959381087630936
      Else (feature 8 not in {0.0})
       Predict: 9.531196597710816
```



### Explain

Your explanation here:

1. What specific predictor variable is being split and what is the value that determines the left / right split. 

- When we training the model we will compute how much each feature contributes to decreasing the impurity. The first feature has been split can be seen as the most important feature. From the random forest model rsult, feature 4 is the first feature, thus the 'y' variable has been split. 
- The value that determines left and right is 5.69, if the value of feature 4 less than 5.69, it will be concluded as left; if the value of feature 4 larger than 5.69, it will be concluded as right.

2. Logic 

Logic one
```
If (feature 4 <= 5.694999933242798)
   If (feature 5 <= 2.9950000047683716)
    If (feature 5 <= 2.834999918937683)
     If (feature 5 <= 2.625)
      If (feature 4 <= 4.255000114440918)
       Predict: 6.26336079504394
```
Logic Two
```
          Else (feature 4 > 4.255000114440918)
       Predict: 6.457152306446421
```
Logic Three
```
      Else (feature 5 > 2.625)
       If (feature 3 <= 4.3450000286102295)
        Predict: 6.4652104428683606
```

# Question 6 (5 pts)
Describe if the random forest model MSE score was better or worse than the MSE score from you best model in homework 3.  Include both scores in your description.

Your improvement explanation here:  


*   MSE from the HW3 is 0.0355
*   MSE after the grid search is 0.0114

The lower Mse shows that the Random forest provides a better result than linear regression, which means random forest model has a better performance than the linear model.





##### Grading Feedback Cell

# Question 7 (5 pts)
Set the `enable_grid_search` Boolean variable to False in the grading cell at the top of this notebook.  Perform a __Runtime -> factory reset__, __Runtime -> Run all__ test to verify there are no runtime errors.  Leave the `enable_grid_search` variable set to False and turn in your assignment.  This is the kind of thing you should be doing before you turn in every assignment. Remember this for future classes and when you get a job in industry.  This question will be graded as all or nothing.  You ether set the Boolean correct or not.  Additional points will be deducted elsewhere for runtime errors.

# Extra Credit (10 pts)
This homework was intended to take less time to complete and be about half the effort of previous assignments.  This doesn't allow us to explore GBT or deep learning.  

For extra credit, train a GBT or Deep Learning model using a grid search.  Protect the grid search inside the if enable_grid_search statement in the first code cell below.  You are free to use K-Fold cross validation if you wish.  The spark documentation for GBM can be found [here](https://spark.apache.org/docs/latest/ml-classification-regression.html#gradient-boosted-tree-classifier).  The spark documentation for deep learning can be found [here](https://spark.apache.org/docs/latest/ml-classification-regression.html#multilayer-perceptron-classifier)

In the second code cell below, hard code the best model parameters as determined by the grid search in a new pipeline named `best_pipe_2`.  Train and test `best_pipe_2` and save your resulting test MSE in a variable.  Do not use K-Fold cross validation when training best_pipe_2.  

In the third code cell below, create a pandas data frame named `compare_1_df` which contains 2 columns: Model and MSE.  Populate the Model column with model names: LR, RF, GBT or DL.  Populate the score column with the linear regression, random forest, and gradient boosted tree or deep learning test MSE scores. The linear regression score is from homework 3. The random forest score is from the random forest model above.  The GBT or Deep Learning score is from this extra credit problem.  Sort compare_1_df such that the best score is in the first row of the data frame. 

To get full credit, your GBT or deep learning solution should produce a score as good or better than the random forest score above.  In addition, the same rules as above apply where all of your grid search code shall be protected by the enable_grid_search Boolean.  Code that produces a runtime error when enable_grid_search is set to False will get 0 credit.

In [15]:
# Your GBT / Deep Learning grid search code here
if enable_grid_search:
  training_df, testing_df = diamonds_df_xformed.randomSplit([0.6, 0.4], seed=7)

  va = VectorAssembler().setInputCols(feature_list).setOutputCol('features')
  
  gbt = GBTRegressor(featuresCol='features',labelCol='price')

  gbt_pipeline = Pipeline(stages=[va,gbt])
  
  paramGrid = ParamGridBuilder().addGrid(gbt.maxDepth, [10,13]).addGrid(gbt.maxBins, [30,35]).build()

  crossval = CrossValidator(estimator = gbt_pipeline,
                            estimatorParamMaps = paramGrid,
                            evaluator = RegressionEvaluator(labelCol="price"),
                            numFolds=3)
  
  cvModel = crossval.fit(training_df)
  predictions = cvModel.transform(testing_df)

  bestPipeline = cvModel.bestModel
  bestModel = bestPipeline.stages[1]
  print('maxBins - ', bestModel.getOrDefault('maxBins'))
  print('maxDepth - ', bestModel.getOrDefault('maxDepth'))
  evaluator = RegressionEvaluator(labelCol="price", predictionCol="prediction", metricName="mse")
  mse = evaluator.evaluate(predictions)
  print("Root Mean Squared Error (MSE) on test data = %g" % mse)

  pass

In [16]:
# your hard coded parameter best model code here
training_df, testing_df = diamonds_df_xformed.randomSplit([0.6, 0.4], seed=7)

va = VectorAssembler().setInputCols(feature_list).setOutputCol('features')
gbt = GBTRegressor(featuresCol='features',labelCol='price', maxDepth = 10, maxBins = 35)
best_pipe = Pipeline(stages=[va,gbt])

best_pipe_2 = best_pipe.fit(training_df)
predictions = best_pipe_2.transform(testing_df)

evaluator = RegressionEvaluator(
    labelCol="price", predictionCol="prediction", metricName="mse")

mse_gbt = evaluator.evaluate(predictions)
print("Mean Squared Error (MSE) on test data = %g" % mse_gbt)

Mean Squared Error (MSE) on test data = 0.0118628


In [17]:
# Create compare_1_df
model = ['LR','RF','GBT']
score = [0.035473544018858036,mse_rf,mse_gbt]
compare_1_df = pd.DataFrame(list(zip(model, score)),columns = ['Model', 'MSE']).sort_values('MSE')

In [18]:
# Grading cell do not modify
display(compare_1_df)

Unnamed: 0,Model,MSE
1,RF,0.011143
2,GBT,0.011863
0,LR,0.035474
