##### This notebook contains the work for Step 5 of the Data Science Method.  It also contains the prep work which was completed in Step 4:

**The Data Science Method**  


1.   Problem Identification 

2.   Data Wrangling 
  * Data Collection 
   * Data Organization
  * Data Definition 
  * Data Cleaning
3.   Exploratory Data Analysis
 * Build data profile tables and plots
        - Outliers & Anomalies
 * Explore data relationships
 * Identification and creation of features</b>

4.   Pre-processing and Training Data Development
  * Create dummy or indicator features for categorical variables
  * Standardize the magnitude of numeric features
  * Split into testing and training datasets
  * Apply scaler to the testing set

<b>5.   Modeling 
  * Fit Models with Training Data Set
  * Review Model Outcomes — Iterate over additional models as needed.
  * Identify the Final Model</b>

6.   Documentation
  * Review the Results
  * Present and share your findings - storytelling
  * Finalize Code 
  * Finalize Documentation

### STEP 4: Pre-processing and Training Data Development

In [1]:
#load python packages
import os
import pandas as pd
import datetime
import seaborn as sns
import matplotlib.pyplot as plt
import numpy as np
%matplotlib inline
from scipy.stats import chi2_contingency
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from pprint import pprint

from sklearn import svm
from sklearn.neighbors import KNeighborsRegressor
from sklearn.linear_model import LinearRegression, LogisticRegression
from sklearn.svm import SVR
from sklearn.metrics import r2_score, explained_variance_score,mean_absolute_error
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import RandomizedSearchCV
from scipy.stats import randint
from timeit import default_timer as timer
from sklearn.tree import export_graphviz
from sklearn import tree
from boruta import BorutaPy
from sklearn.model_selection import GridSearchCV

In [2]:
# set options
pd.set_option('display.max_rows', 1500)

In [3]:
# load the data saved from step 3
df=pd.read_csv('data\step3_output.csv')
df.head()

  interactivity=interactivity, compiler=compiler, result=result)


Unnamed: 0,SCHEDNUM_x,RECEPTION_NUM,INSTRUMENT,SALE_YEAR,SALE_MONTHDAY,RECEPTION_DATE,SALE_PRICE,GRANTOR,GRANTEE,CLASS,...,UNITS,ASMT_APPR_LAND,TOTAL_VALUE,ASDLAND,ASSESS_VALUE,ASMT_TAXABLE,ASMT_EXEMPT_AMT,NBHD_1_CN_y,LEGL_DESCRIPTION,IMPROVE_VALUE
0,14101001000,2008138043,WD,2008,703,20081008,10.0,"ATKINSON,RUSSELL",DREAM BUILDERS LLC,R,...,1,103000,530200,7365,37910,37920,0,N GREEN VALLEY,GREEN VALLEY RANCH FLG #36 B1 L1,427200
1,14101001000,2009074518,WD,2009,605,20090615,299000.0,"ATKINSON,RUSSELL","PADBURY,CHRISTOPHER R",R,...,1,103000,530200,7365,37910,37920,0,N GREEN VALLEY,GREEN VALLEY RANCH FLG #36 B1 L1,427200
2,14101001000,2015157653,WD,2015,1102,20151109,415000.0,"PADBURY,CHRISTOPHER R","MACIEL,HORACIO PEREZ",R,...,1,103000,530200,7365,37910,37920,0,N GREEN VALLEY,GREEN VALLEY RANCH FLG #36 B1 L1,427200
3,14101001000,2009002129,WD,2008,1024,20090108,10.0,DREAM BUILDERS LLC,"ATKINSON,RUSSELL",R,...,1,103000,530200,7365,37910,37920,0,N GREEN VALLEY,GREEN VALLEY RANCH FLG #36 B1 L1,427200
4,14101002000,2010094573,WD,2010,823,20100824,350000.0,"SHEARON,MARK H &","EFREM,TEWEDROS",R,...,1,90400,572600,6464,40941,40940,0,N GREEN VALLEY,GREEN VALLEY RANCH FLG #36 B1 L2,482200


## Listing of fields, arranged in groups with decision about how to handle fields

|Field|type|group|definition|notes|
|---|---|---|---|---|
|CCAGE_RM|float64|age|Remodel Year||
|CCYRBLT|float64|age|Year Built||
|RECEPTION_DATE|int64|date|Clerk & Recorder's Reception Date|drop|
|SALE_MONTHDAY|int64|date|Sale Month/Day||
|SALE_YEAR|int64|date|Sale Year||
|PIN|int64|id|Assessor's Property Identification Number|drop|
|RECEPTION_NUM|int64|id|Input Key|drop|
|SCHEDNUM_x|int64|id|Input Key|drop|
|CO_OWNER|object|ignore|Co-Owner|drop|
|MKT_CLUS|float64|loc|||
|NBHD_1_CN_x|object|loc|Neighborhood name - x|category|
|NBHD_1_CN_y|object|loc|Neighborhood name - y|drop|
|NBHD_1_x|int64|loc|Neighborhood code|ignore|
|SITE_DIR|object|loc|Site Street Direction|ignore|
|SITE_MODE|object|loc|Site Street Type|ignore|
|SITE_MORE|object|loc|Site Unit Number|ignore|
|SITE_NAME|object|loc|Site Street Name|ignore|
|SITE_NBR|int64|loc|Site Street Number|ignore|
|TAX_DIST|object|loc|Tax District|category, ignore?|
|ZONE10|object|loc|Zone|category, ignore?|
|CD|int64|other|Building Number|drop|
|CLASS|object|other|#N/A|drop|
|D_CLASS|int64|other|Property Use Class|category, ignore?|
|D_CLASS_CN_x|object|other|Property Use Class Definition - x|drop|
|D_CLASS_CN_y|object|other|Property Use Class Definition - y|drop|
|GRANTEE|object|other|??|drop|
|GRANTOR|object|other|??|drop|
|INSTRUMENT|object|other|??|drop|
|LEGL_DESCRIPTION|object|other|Description of the parcel per the deed |drop|
|PROP_CLASS|int64|other|Property Class(ASMT-PROP-CODE)|only select 1112|
|PROPERTY_CLASS|object|other|Property Class Description|drop|
|STYLE_CN|object|other|Architecture Style Code Definition|similar to STORY but does not agree|
|OWNER|object|owner|Owner|drop|
|OWNER_APT|object|owner|Street Mailing Unit Number|drop|
|OWNER_CITY|object|owner|Mailing City|drop|
|OWNER_DIR|object|owner|Street Mailing Direction|drop|
|OWNER_NUM|object|owner|Street Mailing Number|drop|
|OWNER_ST|object|owner|Street Mailing Street Name|drop|
|OWNER_STATE|object|owner|Mailing State|drop|
|OWNER_TYPE|object|owner|Street Mailing Type|drop|
|OWNER_ZIP|object|owner|Zip Code|drop|
|AREA_ABG|int64|size|Above Grade Improvement Area||
|BED_RMS|int64|size|Number of bedroom above grade||
|BSMT_AREA|int64|size|Basement Square Footage||
|FBSMT_SQFT|int64|size|Finished Basement Area||
|FULL_B|float64|size|Total number of full baths||
|GRD_AREA|int64|size|Garden Level Square Footage||
|HLF_B|float64|size|Total number of half baths||
|LAND_SQFT|float64|size|Land Area||
|OFCARD|int64|size|Number of Buildings|drop|
|STORY|int64|size|Stories||
|UNITS|int64|size|Number of Units||
|ASDLAND|int64|value|Assessed Land Value|drop|
|ASMT_APPR_LAND|int64|value|Actual Land Value||
|ASMT_EXEMPT_AMT|int64|value|Exempt Amount|drop|
|ASMT_TAXABLE|int64|value|Taxable Amount|drop|
|ASSESS_VALUE|int64|value|Assessed Total Value|drop|
|IMPROVE_VALUE|int64|value|Calculated=Tot Val - Land||
|SALE_PRICE|float64|value|Sale Price||
|TOTAL_VALUE|int64|value|Actual Total Value|drop|


In [4]:
# only select PROP_CLASS = 1112 (Single Family Residential)
df = df[df.PROP_CLASS == 1112]

In [5]:
# instead of looking at the year the house was built (CCYRBLT), transform it to years since built
df = df.assign(AGE = df.SALE_YEAR-df.CCYRBLT)

In [6]:
# instead of looking at year of remodel (CCAGE_RM), transform it to years since remodel
df=df.assign(RM_AGE = df.SALE_YEAR-df.CCAGE_RM)
# if remodel year is after sale (RM_AGE will be negative), then reset RM_AGE to AGE
df.loc[df.RM_AGE < 0, 'RM_AGE'] = df.AGE
# set remodel age to home age if no remodel year is available
df.loc[df.CCAGE_RM == 0, 'RM_AGE'] = df.AGE

In [7]:
# show results of new columns
print(df.groupby(['SALE_YEAR','CCYRBLT','CCAGE_RM','AGE','RM_AGE']).size())

SALE_YEAR  CCYRBLT  CCAGE_RM  AGE    RM_AGE
2008       1876.0   0.0       132.0  132.0       1
           1880.0   0.0       128.0  128.0       1
                    2001.0    128.0  7.0         1
           1881.0   0.0       127.0  127.0       2
                    2012.0    127.0  127.0       1
                                              ... 
2020       2018.0   0.0       2.0    2.0        31
                    2018.0    2.0    2.0        14
                    2019.0    2.0    1.0         1
           2019.0   0.0       1.0    1.0       171
                    2019.0    1.0    1.0        18
Length: 24363, dtype: int64


In [8]:
# drop columns that are not going to be used
df.drop(['RECEPTION_DATE', 'PIN', 'RECEPTION_NUM', 'SCHEDNUM_x', 'CO_OWNER', 'NBHD_1_CN_y', 'NBHD_1_x'
         , 'SITE_DIR', 'SITE_MODE', 'SITE_MORE', 'SITE_NAME', 'SITE_NBR', 'TAX_DIST', 'ZONE10', 'CD', 'CLASS'
         , 'D_CLASS', 'D_CLASS_CN_x', 'D_CLASS_CN_y', 'GRANTEE', 'GRANTOR', 'INSTRUMENT', 'LEGL_DESCRIPTION'
         , 'PROP_CLASS', 'PROPERTY_CLASS', 'STYLE_CN', 'OWNER', 'OWNER_APT', 'OWNER_CITY', 'OWNER_DIR', 'OWNER_NUM'
         , 'OWNER_ST', 'OWNER_STATE', 'OWNER_TYPE', 'OWNER_ZIP', 'OFCARD', 'ASDLAND', 'ASMT_EXEMPT_AMT'
         , 'ASMT_TAXABLE', 'ASSESS_VALUE', 'TOTAL_VALUE', 'CCYRBLT', 'CCAGE_RM'], axis=1, inplace=True)

In [9]:
# drop any rows with NaN values
df.dropna(inplace=True)

In [10]:
# change categorical values (NBHD_1_CN_x) to indicators
#df_inds = pd.get_dummies(data=df, columns=['NBHD_1_CN_x'],drop_first=True)

In [11]:
#df_inds.describe().T

In [12]:
# Create feature and target arrays
y = df['IMPROVE_VALUE'].ravel()
proposed = df.drop(['IMPROVE_VALUE','SALE_PRICE','NBHD_1_CN_x'], axis=1)

In [13]:
###initialize Boruta
forest = RandomForestRegressor(
   n_jobs = -1, 
   max_depth = 5
)
boruta = BorutaPy(
   estimator = forest, 
   n_estimators = 'auto',
   max_iter = 100 # number of trials to perform
)
### fit Boruta (it accepts np.array, not pd.DataFrame)
boruta.fit(np.array(proposed),np.array(y))
### print results
green_area = proposed.columns[boruta.support_].to_list()
blue_area = proposed.columns[boruta.support_weak_].to_list()
print('features in the green area:', green_area)
print('features in the blue area:', blue_area)

features in the green area: ['MKT_CLUS', 'LAND_SQFT', 'AREA_ABG', 'BSMT_AREA', 'FBSMT_SQFT', 'FULL_B', 'HLF_B', 'ASMT_APPR_LAND', 'AGE']
features in the blue area: ['RM_AGE']


In [14]:
X=proposed[['MKT_CLUS', 'LAND_SQFT', 'AREA_ABG', 'BSMT_AREA', 'FBSMT_SQFT', 'FULL_B', 'HLF_B', 'ASMT_APPR_LAND', 'AGE', 'RM_AGE']]
X.head()

Unnamed: 0,MKT_CLUS,LAND_SQFT,AREA_ABG,BSMT_AREA,FBSMT_SQFT,FULL_B,HLF_B,ASMT_APPR_LAND,AGE,RM_AGE
0,6.0,8622.0,4130,1771,0,5.0,0.0,103000,3.0,3.0
1,6.0,8622.0,4130,1771,0,5.0,0.0,103000,4.0,4.0
2,6.0,8622.0,4130,1771,0,5.0,0.0,103000,10.0,10.0
3,6.0,8622.0,4130,1771,0,5.0,0.0,103000,3.0,3.0
4,6.0,7232.0,3902,1540,1125,5.0,0.0,90400,5.0,5.0


In [15]:
# now scale the X data
scaler = StandardScaler().fit(X)
X_scaled=scaler.transform(X)

In [16]:
# Split into training and test set
X_train, X_test, y_train, y_test = train_test_split(X_scaled, y, test_size = 0.2, random_state=42)

## Fit Models with Training Data Set

### Test : Linear Regression, Random Forest, K Nearest Neighbors, SVR, and Logistic Regression

### Model 1: Linear Regression

In [17]:
lm = LinearRegression()
model1 = lm.fit(X_train,y_train)

In [18]:
y_pred1 =model1.predict(X_test)

In [19]:

# show model results 
print('explained variance score: ',explained_variance_score(y_test, y_pred1))
print('MAE: ',mean_absolute_error(y_test, y_pred1))
print('R2 Score: ',r2_score(y_test, y_pred1))
print('intercept: ',lm.intercept_)
pd.DataFrame(abs(lm.coef_),X.columns,columns=['Coeff']).sort_values(by=['Coeff'],ascending=False).head(10)

explained variance score:  0.6654955053359728
MAE:  98359.41404029226
R2 Score:  0.6654683936478317
intercept:  336830.24863234244


Unnamed: 0,Coeff
AREA_ABG,162743.543017
FBSMT_SQFT,53920.748444
FULL_B,24906.329517
LAND_SQFT,24161.851322
HLF_B,17959.88331
ASMT_APPR_LAND,15523.23988
MKT_CLUS,12523.028427
RM_AGE,9195.118552
BSMT_AREA,4464.154858
AGE,3685.329314


### Model 2: Random Forest

In [20]:
rf = RandomForestRegressor(n_estimators=100, max_features='sqrt')
model2 = rf.fit(X_train,y_train)

In [21]:
y_pred2 =model2.predict(X_test)

In [22]:
# show model results
print('explained variance score: ',explained_variance_score(y_test, y_pred2))
print('MAE: ',mean_absolute_error(y_test, y_pred2))
print('R2 score is: ',r2_score(y_test, y_pred2))

explained variance score:  0.9158093603272732
MAE:  34067.51596992946
R2 score is:  0.9158081119840211


### Model 3: k nearest neighbors

In [23]:
kn = KNeighborsRegressor(n_neighbors=6)
model3 = kn.fit(X_train,y_train)

In [24]:
y_pred3 =model3.predict(X_test)

In [25]:
# show model results
print('explained variance score: ',explained_variance_score(y_test, y_pred3))
print('MAE: ',mean_absolute_error(y_test, y_pred3))
print('R2 score is: ',r2_score(y_test, y_pred3))

explained variance score:  0.8382759279871329
MAE:  51457.2233431092
R2 score is:  0.8382728691916164


### Model 4: SVR

In [26]:
sv = SVR(kernel='linear')
model4 = sv.fit(X_train,y_train)

In [27]:
# show model results
y_pred4 =model4.predict(X_test)

print('explained variance score: ',explained_variance_score(y_test, y_pred4))
print('MAE: ',mean_absolute_error(y_test, y_pred4))
print('R2 score is: ',r2_score(y_test, y_pred4))

explained variance score:  0.3966371787007875
MAE:  105983.11928309638
R2 score is:  0.37353467579361754


In [28]:
#Summarize results from all four models

print ( "                                 1        2        3        4")

print ( "explained variance score:     %.2f     %.2f     %.2f     %.2f " % (
    explained_variance_score(y_test, y_pred1),
    explained_variance_score(y_test, y_pred2),
    explained_variance_score(y_test, y_pred3),
    explained_variance_score(y_test, y_pred4),
))



print ( "     mean absolute error: %.2f %.2f %.2f %.2f " % (    
    mean_absolute_error(y_test, y_pred1),
    mean_absolute_error(y_test, y_pred2),
    mean_absolute_error(y_test, y_pred3),
    mean_absolute_error(y_test, y_pred4),
))


print ( "             R2 score is:     %.2f     %.2f     %.2f     %.2f " % (
    r2_score(y_test, y_pred1),
    r2_score(y_test, y_pred2),
    r2_score(y_test, y_pred3),
    r2_score(y_test, y_pred4),
))


                                 1        2        3        4
explained variance score:     0.67     0.92     0.84     0.40 
     mean absolute error: 98359.41 34067.52 51457.22 105983.12 
             R2 score is:     0.67     0.92     0.84     0.37 


## Tuning Hyperparameters of Random Forest model

In [29]:
# display parameters of current model
print('Parameters currently in use:\n')
pprint(rf.get_params())

Parameters currently in use:

{'bootstrap': True,
 'ccp_alpha': 0.0,
 'criterion': 'mse',
 'max_depth': None,
 'max_features': 'sqrt',
 'max_leaf_nodes': None,
 'max_samples': None,
 'min_impurity_decrease': 0.0,
 'min_impurity_split': None,
 'min_samples_leaf': 1,
 'min_samples_split': 2,
 'min_weight_fraction_leaf': 0.0,
 'n_estimators': 100,
 'n_jobs': None,
 'oob_score': False,
 'random_state': None,
 'verbose': 0,
 'warm_start': False}


In [30]:
# Create the random grid
random_grid = {"max_features": ['auto', 'sqrt'],
               "max_depth": [1,10,20,30,40,50,60,70,80,90,100, None],
               "min_samples_leaf": [1,3,10],
               "min_samples_split": [2,5,10],
               "bootstrap": [True, False],
               "n_estimators": [10,100]}
pprint(random_grid)

{'bootstrap': [True, False],
 'max_depth': [1, 10, 20, 30, 40, 50, 60, 70, 80, 90, 100, None],
 'max_features': ['auto', 'sqrt'],
 'min_samples_leaf': [1, 3, 10],
 'min_samples_split': [2, 5, 10],
 'n_estimators': [10, 100]}


In [31]:
# Use the random grid to search for best hyperparameters
# First create the base model to tune
rf2 = RandomForestRegressor()
# Random search of parameters, using 3 fold cross validation, 
# search across 100 different combinations, and use all available cores
rf_random = RandomizedSearchCV(estimator = rf2, param_distributions = random_grid, n_iter = 50, cv = 3, verbose=10, random_state=42, n_jobs = -1)

In [32]:
# Fit the random search model
rf_random.fit(X_train, y_train)

Fitting 3 folds for each of 50 candidates, totalling 150 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 6 concurrent workers.
[Parallel(n_jobs=-1)]: Done   1 tasks      | elapsed:    4.1s
[Parallel(n_jobs=-1)]: Done   6 tasks      | elapsed:   11.2s
[Parallel(n_jobs=-1)]: Done  13 tasks      | elapsed:   36.7s
[Parallel(n_jobs=-1)]: Done  20 tasks      | elapsed:  1.5min
[Parallel(n_jobs=-1)]: Done  29 tasks      | elapsed:  2.6min
[Parallel(n_jobs=-1)]: Done  38 tasks      | elapsed:  2.9min
[Parallel(n_jobs=-1)]: Done  49 tasks      | elapsed:  3.5min
[Parallel(n_jobs=-1)]: Done  60 tasks      | elapsed:  3.7min
[Parallel(n_jobs=-1)]: Done  73 tasks      | elapsed:  5.1min
[Parallel(n_jobs=-1)]: Done  86 tasks      | elapsed:  6.2min
[Parallel(n_jobs=-1)]: Done 101 tasks      | elapsed:  6.6min
[Parallel(n_jobs=-1)]: Done 116 tasks      | elapsed:  7.7min
[Parallel(n_jobs=-1)]: Done 133 tasks      | elapsed:  9.0min
[Parallel(n_jobs=-1)]: Done 150 out of 150 | elapsed: 10.3min finished


RandomizedSearchCV(cv=3, error_score=nan,
                   estimator=RandomForestRegressor(bootstrap=True,
                                                   ccp_alpha=0.0,
                                                   criterion='mse',
                                                   max_depth=None,
                                                   max_features='auto',
                                                   max_leaf_nodes=None,
                                                   max_samples=None,
                                                   min_impurity_decrease=0.0,
                                                   min_impurity_split=None,
                                                   min_samples_leaf=1,
                                                   min_samples_split=2,
                                                   min_weight_fraction_leaf=0.0,
                                                   n_estimators=100,
                              

In [33]:
# display best parameters
rf_random.best_params_

{'n_estimators': 100,
 'min_samples_split': 2,
 'min_samples_leaf': 1,
 'max_features': 'sqrt',
 'max_depth': 100,
 'bootstrap': False}

In [34]:
start = timer()
# rerun model with best parameters
rf3 = RandomForestRegressor(bootstrap=False, max_depth=50, min_samples_leaf=1, min_samples_split=2, n_estimators=100, max_features='sqrt')
model22 = rf3.fit(X_train,y_train)
end = timer()
print (end-start)

46.56587911299994


In [35]:
y_pred22 =model22.predict(X_test)

In [36]:
# show model results
print('explained variance score: ',explained_variance_score(y_test, y_pred22))
print('MAE: ',mean_absolute_error(y_test, y_pred22))
print('R2 score is: ',r2_score(y_test, y_pred22))

explained variance score:  0.9299560166603671
MAE:  27460.279150475915
R2 score is:  0.9299507610207416


In [37]:
# Create the parameter grid based on the results of random search 
param_grid = {
    'bootstrap': [False],
    'max_depth': [80, 90, 100, 110],
    'max_features': [2, 3, 4],
    'min_samples_leaf': [1, 2, 10],
    'min_samples_split': [1, 2, 10, 12],
    'n_estimators': [100]
}
# Create a based model
rfg = RandomForestRegressor()
# Instantiate the grid search model
grid_search = GridSearchCV(estimator = rfg, param_grid = param_grid, 
                          cv = 3, n_jobs = -1, verbose = 10)

In [38]:
# Fit the grid search to the data
grid_search.fit(X_train, y_train)

Fitting 3 folds for each of 144 candidates, totalling 432 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 6 concurrent workers.
[Parallel(n_jobs=-1)]: Done   1 tasks      | elapsed:    0.1s
[Parallel(n_jobs=-1)]: Batch computation too fast (0.1719s.) Setting batch_size=2.
[Parallel(n_jobs=-1)]: Done   6 tasks      | elapsed:   23.7s
[Parallel(n_jobs=-1)]: Batch computation too slow (32.1368s.) Setting batch_size=1.
[Parallel(n_jobs=-1)]: Done  14 tasks      | elapsed:   47.7s
[Parallel(n_jobs=-1)]: Done  27 tasks      | elapsed:  1.5min
[Parallel(n_jobs=-1)]: Done  41 tasks      | elapsed:  2.2min
[Parallel(n_jobs=-1)]: Done  50 tasks      | elapsed:  2.7min
[Parallel(n_jobs=-1)]: Done  61 tasks      | elapsed:  3.3min
[Parallel(n_jobs=-1)]: Done  72 tasks      | elapsed:  4.0min
[Parallel(n_jobs=-1)]: Done  85 tasks      | elapsed:  5.1min
[Parallel(n_jobs=-1)]: Done  98 tasks      | elapsed:  6.0min
[Parallel(n_jobs=-1)]: Done 113 tasks      | elapsed:  7.1min
[Parallel(n_jobs=-1)]: Done 128 tasks      | elapsed:  7.8min
[Parallel(n_j

GridSearchCV(cv=3, error_score=nan,
             estimator=RandomForestRegressor(bootstrap=True, ccp_alpha=0.0,
                                             criterion='mse', max_depth=None,
                                             max_features='auto',
                                             max_leaf_nodes=None,
                                             max_samples=None,
                                             min_impurity_decrease=0.0,
                                             min_impurity_split=None,
                                             min_samples_leaf=1,
                                             min_samples_split=2,
                                             min_weight_fraction_leaf=0.0,
                                             n_estimators=100, n_jobs=None,
                                             oob_score=False, random_state=None,
                                             verbose=0, warm_start=False),
             iid='deprecated', n_jo

In [39]:
#display best params
grid_search.best_params_

{'bootstrap': False,
 'max_depth': 100,
 'max_features': 4,
 'min_samples_leaf': 1,
 'min_samples_split': 2,
 'n_estimators': 100}

In [40]:
best_grid = grid_search.best_estimator_

In [41]:
y_pred23 =best_grid.predict(X_test)

In [42]:
# show model results
print('explained variance score: ',explained_variance_score(y_test, y_pred23))
print('MAE: ',mean_absolute_error(y_test, y_pred23))
print('R2 score is: ',r2_score(y_test, y_pred23))

explained variance score:  0.9298258384643732
MAE:  26645.860100417114
R2 score is:  0.9298163248889201
